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Preface 


This book is written for high school and college students learning about probability 
for the first time. Most of the book is very practical, with a large number of concrete 
examples and worked-out problems. However, there are also parts that are a bit 
theoretical (at least for an introductory book), with many mathematical derivations. 
All in all, if you are looking for a book that serves as a quick reference, this may not 
be the one for you. But if you are looking for a book that starts at the beginning and 
derives everything from scratch in a comprehensive manner, then you’ve come to 
the right place. In short, this book will appeal to the reader who has a healthy level 
of enthusiasm for understanding how and why the standard results of probability 
come about. 

Probability is a very accessible (and extremely fun!) subject, packed with chal¬ 
lenging problems that don’t require substantial background or serious math. The 
examples in Chapter 2 are a testament to this. Of course, there are plenty of chal¬ 
lenging topics in probability that do require a more formal background and some 
heavy-duty math. This will become evident in Chapters 4 and 5 (and the latter part 
of Chapter 3). However, technically the only math prerequisite for this book is a 
comfort with algebra. Calculus isn’t relied on, although there are a few problems 
that do involve calculus. These are marked clearly. 

All of the problems posed at the ends of the chapters have solutions included. 
The difficulty is indicated by stars; most problems have two stars. One star means 
plug and chug, while three stars mean some serious thinking. Be sure to give a solid 
effort when solving a problem, and don’t look at the solution too soon. If you can’t 
solve a problem right away, that’s perfectly fine. Just set it aside and come back to 
it later. It’s better to solve a problem later than to read the solution now. If you do 
eventually need to look at a solution, cover it up with a piece of paper and read one 
line at a time, to get a hint to get started. Then set the book aside and work things 
out for real. That way, you can still (mostly) solve it on your own. You will learn 
a great deal this way. If you instead head right to the solution and read it straight 
through, you will learn very little. 

For instructors using this book as the assigned textbook for a course, a set of 
homework exercises is posted at www.people.fas.harvard.edu/~djmorin/book.html. 
A solutions manual is available to instructors upon request. When sending a request, 
please point to a syllabus and/or webpage for the course. 

The outline of this book is as follows. Chapter 1 covers combinatorics, which 
is the study of how to count things. Counting is critical in probability, because 
probabilities often come down to counting the number of ways that something can 



happen. In Chapter 2 we dive into actual probability. This chapter includes a large 
number of examples, ranging from coins to cards to four classic problems presented 
in Section 2.4. Chapter 3 covers expectation values, including the variance and 
standard deviation. A section on the “sample variance” is included; this is rather 
mathematical and can be skipped on a first reading. In Chapter 4 we introduce the 
concept of a continuous distribution and then discuss a number of the more com¬ 
mon probability distributions. In Chapter 5 we see how the binomial and Poisson 
distributions reduce to a Gaussian (or normal) distribution in certain limits. We 
also discuss the law of large numbers and the central limit theorem. Chapter 6 is 
somewhat of a stand-alone chapter, covering correlation and regression. Although 
these topics are usually found in books on statistics, it makes sense to include them 
here, because all of the framework has been set. Chapter 7 contains six appendices. 
Appendix C deals with approximations to (1 + a)" which are critical in the calcu¬ 
lations in Chapter 5, Appendix E lists all of the main results we derive in the book, 
and Appendix F contains a glossary of notation; you may want to refer to this when 
starting each chapter. 

A few informational odds and ends: This book contains many supplementary 
remarks that are separated off from the main text; these end with a shamrock, 

The letters A, n, and k generally denote integers, while x and t generally denote 
continuous quantities. Upper-case letters like X denote a random variable, while 
lower-case letters like x denote the value that the random variable takes. We re¬ 
fer to the normal distribution by its other name, the “Gaussian” distribution. The 
numerical plots were generated with Mathematica. I will sometimes use “they” as 
a gender-neutral singular pronoun, in protest of the present failing of the English 
language. And I will often use an “ ’s” to indicate the plural of one-letter items (like 
6’s on dice rolls). Lastly, we of course take the frequentist approach to probability 
in this introductory book. 

I would particularly like to thank Carey Witkov for meticulously reading through 
the entire book and offering many valuable suggestions. Joe Swingle provided many 
helpful comments and sanity checks throughout the writing process. Other friends 
and colleagues whose input I am grateful for are Jacob Barandes, Sharon Bene¬ 
dict, Joe Blitzstein, Brian Hall, Theresa Morin Hall, Paul Horowitz, Dave Patterson, 
Alexia Schulz, and Corri Taylor. 

Despite careful editing, there is essentially zero probability that this book is 
error free (as you can show in Problem 4.16!). If anything looks amiss, please check 
the webpage www.people.fas.harvard.edu/~djmorin/book.html for a list of typos, 
updates, additional material, etc. And please let me know if you discover some¬ 
thing that isn’t already posted. Suggestions are always welcome. 


David Morin 
Cambridge, MA 



Chapter 1 


Combinatorics 


TO THE READER: This book is available as both a paperback and an eBook. I 
have made a few chapters available on the web, but it is possible (based on past 
experience) that a pirated version of the complete book will eventually appear on 
file-sharing sites. In the event that you are reading such a version, I have a request: 

If you don't find this book useful (in which case you probably would have returned 
it, if you had bought it), or if you do find it useful but aren’t able to afford it, then 
no worries; carry on. However, if you do find it useful and are able to afford the 
Kindle eBook (priced below $10), then please consider purchasing it (available 
on Amazon). If you don’t already have the Kindle reading app for your computer, 
you can download it free from Amazon. I chose to self-publish this book so that I 
could keep the cost low. The resulting eBook price of around $10, which is very 
inexpensive for a 350-page math book, is less than a movie and a bag of popcorn, 
with the added bonus that the book lasts for more than two hours and has zero 
calories (if used properly!). 

- David Morin 


Combinatorics is the study of how to count things. By “things” we mean the various 
combinations, permutations (different orderings), subgroups, and so on, that can be 
formed from a given set of objects/people/etc. For example, how many different 
outcomes are possible if you flip a coin four times? How many different full-house 
hands are there in poker? How many different committees of three people can be 
chosen from five people? What if we additionally designate one person as the com¬ 
mittee’s president? Knowing how to count these types of things is critical for an 
understanding of probability, because when calculating the probability of a given 
event, we often need to count the number of ways that the event can happen. 

The outline of this chapter is as follows. In Section 1.1 we introduce the con¬ 
cept of factorials, which are ubiquitous in the study of probability. In Section 1.2 
we learn how to count the number of possible permutations (orderings) of a set of 
objects. Section 1.3 covers the number of possible combined outcomes of a repeated 
experiment, where each repetition has an identical set of possible results. Examples 
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include rolling dice and flipping coins. In Section 1.4 we learn how to count the 
number of subgroups that can be formed from a given set of objects, where the or¬ 
der within the subgroup matters. An example is choosing a committee of people 
in which all of the positions are distinct. Section 1.5 covers the related question of 
the number of subgroups that can be formed from a given set of objects, where the 
order within the subgroup doesn’t matter. An example is a poker hand; the order 
of the cards in the hand is irrelevant. We find that the answer takes the form of a 
binomial coefficient. In Section 1.6 we summarize the various results we have found 
so far. We discover that one result is missing from our counting repertoire, and we 
remedy this in Section 1.7. In Section 1.8 we look at the binomial coefficients in 
more detail. 

After learning in this chapter how to count all sorts of things, we’ll see in Chap¬ 
ter 2 how the counting can be used to calculate probabilities. It’s usually a trivial 
step to obtain a probability once you’ve counted the relevant things, so the work we 
do here will prove well worth it. 


1.1 Factorials 

Before getting into the discussion of actual combinatorics, we first need to look at a 
certain quantity that comes up again and again. This quantity is called the factorial. 
We’ll see throughout this chapter that when dealing with a situation that involves 
an integer A, we often need to consider the product of the first A integers. This 
product is called “A factorial,” and it is denoted by “A!”. 1 For the first few integers, 
we have: 


1! = 

1, 


2! = 

1 -2 

= 2, 

3! = 

1 -2 

3=6, 

4! = 

1 -2 

3 • 4 = 24, 

5! = 

1 -2 

3 • 4 ■ 5 = 120, 

6! = 

1 -2 

3 • 4 • 5 • 6 = 720. 


As A increases. A! gets very large very fast. For example, 10! = 3,628,800, and 
20! « 2.43 • 10 18 . In Chapter 2 we will introduce an approximation to A! called 
Stirling’s formula. This formula makes it clear what we mean by the statement, “A! 
gets very large very fast.” 

We should add that 0! is defined to be 1. Of course, 0! doesn’t make much sense, 
because when we talk about the product of the first A integers, it is understood that 
we start with 1. Since 0 is below this starting point, it is unclear what 0! actually 
means. However, there is no need to try too hard to make sense of it, because as 
we’ll see below, if we simply define 0! to be 1, then a number of formulas turn out 
to be very nice. 

! I don’t know why someone long ago picked the exclamation mark for this notation. But just re¬ 
member that it has nothing to do with the more common grammatical use of the exclamation mark for 
emphasis. So try not to get too excited when you see “N!”! 
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Having defined /V!, we can now start counting things. With the exception of the 
result in Section 1.3, all of the main results in this chapter involve factorials. 


1.2 Permutations 

A permutation of a set of objects is a way of ordering them. For example, if we have 
three people - Alice, Bob, and Carol - then one permutation of them is Alice, Bob, 
Carol. Another permutation is Carol, Alice, Bob. Another is Bob, Alice, Carol. It 
turns out that there are six permutations in all, as we will see below. The goal of this 
section is to learn how to count the number of possible permutations. We’ll do this 
by starting off with the very simple case where we have only one object. Then we’ll 
consider two objects, then three, and so on, until we see a pattern. The route we 
take here will be a common one throughout this book: Although many of the results 
can be derived in a few lines of reasoning, we’ll take the longer route where we 
start with a few simple examples and then generalize until we arrive at the desired 
results. Concrete examples always make it easier to understand a general result. 

One object 

If we have only one object, then there is clearly only one way to “order” it; there is 
no ordering to be done. A list of one object simply consists of that one object, and 
that’s that. If we use the notation where /fy stands for the number of permutations 
of N objects, then we have P\ = 1. 

Two objects 

With two objects, things aren’t completely trivial like they are in the one-object 
case, but they’re still very simple. If we label our two objects as 1 and 2, then we 
can order them in two ways: 

12 or 2 1 

So we have Pi = 2. At this point, you might be thinking that this result, along with 
the above P\ — 1 result, suggests that /fy = N for any positive integer N. This 
would mean that there should be three different ways to order three objects. Well, 
not so fast... 

Three objects 

Things get more interesting with three objects. If we call them 1, 2, and 3, then we 
can list out the possible orderings. The permutations are shown in Table 6.1. 

123 213 312 

1 3 2 2 3 1 3 2 1 


Table 1.1: Permutations of three objects. 
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So we have P 3 = 6 . Note that we’ve grouped these six permutations into three 
subgroups (the three columns), according to which number comes first. It isn’t nec¬ 
essary to group them this way, but we’ll see below that this method of organization 
has definite advantages. It will simplify how we think about the case where the 
number of objects is a general number N. 

Remark: There is no need to use the numbers 1, 2, 3 to represent the three objects. You can 
use whatever symbols you want. For example, the letters A, B. C work fine, as do the letters 
H, Q, Z. You can even use symbols like ®, *, 9. Or you can mix things up with O, W. 7. The 
point is that the numbers/letters/symbols/whatever simply stand for three different things, and 
they need not have any meaningful properties except for their different appearances when you 
write them down. However, having said this, there is certainly something simple about the 
numbers 1, 2, 3, ..., or the letters A, B, C, ..., so we’ll generally work with these. In any 
case, it is usually a good idea to be as economical as possible and not write down the full 
names, such as Alice, Bob, Carol, etc. * 


Four objects 

The pattern so far is Pi = 1, Pi — 2, and P 3 = 6 . Although you might be able to 
guess the general rule from these three results, it will be easier to see the pattern 
if we look at the next case with four objects. Taking a cue from the above list of 
six permutations of three objects, let’s organize the permutations of four objects 
(labeled 1, 2, 3, 4) according to which number comes first. We end up with the 24 
permutations shown in Table 1.2. 


12 34 

2 1 

3 

4 

3 

1 

2 

4 

4 

1 

2 

3 

1243 

2 1 

4 

3 

3 

1 

4 

2 

4 

1 

3 

2 

13 24 

23 

1 

4 

3 

2 

1 

4 

4 

2 

1 

3 

1 342 

23 

4 

1 

3 

2 

4 

1 

4 

2 

3 

1 

1423 

24 

1 

3 

3 

4 

1 

2 

4 

3 

1 

2 

1432 

24 

3 

1 

3 

4 

2 

1 

4 

3 

2 

1 


Table 1.2: Permutations of four objects. 


If we look at the last column, where all the permutations start with 4, we see that if 
we strip off the 4, we’re simply left with the six permutations of the three numbers 
1, 2, 3 that we listed in Table 6.1. A similar thing happens with the column of per¬ 
mutations that start with 3. If we strip off the 3, we’re left with the six permutations 
of the numbers 1, 2, 4. Likewise for the columns of permutations that start with 2 
or 1. The 24 permutations listed in Table 1.2 can therefore be thought of as four 
groups (the four columns), each consisting of six permutations. 


Five objects 

For five objects, you probably don’t want to write down all the permutations, be¬ 
cause it turns out that there are 120 of them. But you can imagine writing them 
all down. And for the present purposes, that’s just as good as (or even better than) 
actually writing them down for real. 
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Consider the permutations of 1, 2, 3, 4, 5 that start with 1. From the above result 
for the N - 4 case, the other four numbers 2, 3, 4, 5 can be permuted in 24 ways. 
So there are 24 permutations that start with 1. Likewise, there are 24 permutations 
that start with 2. And similarly for 3, 4, and 5. So we have five groups (columns, 
if you want to imagine writing them that way), each consisting of 24 permutations. 
The total number of permutations of five objects is therefore 5 ■ 24 = 120. 


General case of N objects 

Collecting the above results, we have 

Pi = 1, P 2 = 2, Pi =6, P 4 = 24, P 5 = 120. (1.2) 


Do these numbers look familiar? Yes indeed, they are simply the Nl results in 
Eq. (1.1). Does this equivalence make sense? Yes, due to the following reasoning. 

• Pi — 1 , of course. 

• Pi — 2 , which can be written in the suggestive form, P 2 — 2 • 1 . 

• For P 2 , Table 6.1 shows that P\ - 6 can be thought of as three groups (char¬ 
acterized by which number appears first) of the P 2 — 2 permutations of the 
second and third numbers. So we have P 2 - 3 P 2 — 3 • 2 • 1. 

• Similarly, for P 4 , Table 1.2 shows that P 4 = 24 can be thought of as four 
groups (characterized by which number appears first) of the P 2 = 6 permu¬ 
tations of the second, third, and fourth numbers. So we have P 4 — 4P( = 
4 • 3 • 2 • 1. 

• Likewise, the above reasoning for N — 5 shows that P 5 = 5P 4 = 5 • 4 • 3 • 2 • 1. 
And so on and so forth. Therefore: 


• At each stage, we have Py = N ■ Pn-i • Since the sequence of numbers starts 
with Pi = 1, this relation is easily seen to be satisfied by the general formula. 


Pn = Nl 


(13) 


Basically, you just need to tack on a factor of N at each stage, due to the 
fact that the permutations can start with any of the N numbers (or whatever 
objects you’re dealing with). The number of permutations of N objects is 
therefore Nl. 


The strategy of assigning seats 

An equivalent way of thinking about the P/v = Nl result is the following. For 
concreteness, let’s say that we have four people, Alice, Bob, Carol, and Dave. And 
let’s assume that they need to be assigned to four seats arranged in a line. The Nl 
result tells us that there are 4! = 24 different permutations they can take. We’ll now 
give an alternative derivation that shows how these 24 orderings can be understood 
easily by imagining the seats being filled one at a time. We’ll get a lot of mileage 
out of this type of “seat filling” argument throughout this chapter and the next. 
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• There are four possibilities for who is assigned to the first seat. 

• For each of these four possibilities, there are three possibilities for who is 
assigned to the second seat (because we’ve already assigned one person, so 
there are only three people left). There are therefore 4-3 = 12 possibilities 
for how the inhabitants of the first two seats are chosen. 

• For each of these 12 possibilities, there are two possibilities for who is as¬ 
signed to the third seat (because there are only two people left). There are 
therefore 4 ■ 3 ■ 2 = 24 possibilities for how the inhabitants of the first three 
seats are chosen. 

• Finally, for each of these 24 possibilities, there is only one possibility for who 
is assigned to the fourth seat (because there is only one person left, so we’re 
stuck with him/her). There are therefore 4 • 3 • 2 • 1 =24 possibilities for 
how the inhabitants of all four seats are chosen. The 1 here doesn’t matter, of 
course; it just makes the formula look nicer. 

You can see how this counting works for the N = 4 case in Table 1.2. There 
are four possibilities for the first entry, which stands for the person assigned to the 
first seat if we label the people by 1,2, 3, 4. Once we pick the first entry, there are 
three possibilities for the second entry. And once we pick the second entry, there 
are two possibilities for the third entry. And finally, once we pick the third entry, 
there is only one possibility for the fourth entry. You can verify all these statements 
by looking at the table. 

If you want to think in terms of a picture, the above process is depicted in the 
branching tree in Fig. 1.1. We’ve changed the numbers 1, 2, 3, 4 to the letters A, 
B, C, D, with the different possibilities at each branch being listed in alphabetical 
order, left to right. We’ve listed the four possibilities in the first stage and the twelve 
possibilities in the second stage. However, we haven’t listed the 24 possibilities in 
each of the last two stages, because there isn’t room in the figure. But one possibility 
in each stage is shown. 



possibilities 

possibilities 

possibilities 

possibilities 


Figure 1.1: The branching tree for permutations of four objects. The number of branches in 
each fork decreases by one at each successive stage. 

It should be emphasized that when dealing with situations that involve state¬ 
ments such as, “There are a possibilities for Outcome 1, and for each of these there 
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are b possibilities for Outcome 2, and for each of these there are c possibilities for 
Outcome 3, and so on...the total number of different possibilities when all of the 
outcomes are listed together is the product (not the sum!) of the numbers of possibil¬ 
ities for the different outcomes, that is , a ■ b ■ c ■ ■ ■. You should stare at Table 1.2 and 
Fig. 1.1 until you’re comfortable with this. The reason for the product boils down to 
the words, .. for each of these...in the above statement. As a simple analogy, if 
7 people are each carrying 3 books, then there are 7 • 3 = 21 (not 7 + 3 = 10) books 
in all. 


Example (Five plus four): Nine people are to be assigned to nine seats in a row, with 
the stipulation that five specific people go in the left five seats, and the remaining four 
people go in the right four seats. How many different assignments can be made? 

Solution: There are five ways to put someone (from the five specific people) in the 
leftmost seat, and then for each of these five ways there are four ways to put someone 
(from the remaining four of the five specific people) in the next seat, and so on. So 
there are 5! = 120 ways to assign the five specific people to the left five seats. For each 
of these 5! ways, there are 4 ! = 24 ways to assign the remaining four people to the right 
four seats (by the same reasoning as above). The total number of ways of assigning 
the nine people (with the given restriction) is therefore 5! • 4! = 120 ■ 24 = 2,880. 
Note that this result is much smaller than the 9! = 362,880 result in the case where 
there is no restriction, that is, where any person can sit in any seat. The ratio of these 
two results is 9!/(5! ■ 4!) = 126. This sort of number (a quotient involving three 
factorials) will play a huge role in Section 1.5. 


1.3 Ordered sets, repetitions allowed 

In this section we’ll learn how to count the number of possible outcomes of repeated 
identical processes/trials/experiments, where the order of the individual results mat¬ 
ters. This scenario is the first of four related scenarios we’ll discuss in this chapter. 
These are summarized later in Tables 1.11 and 1.12, with the present scenario being 
the upper-left one in the table. Two common examples are repeated rolls of a die 
and repeated flips of a coin. We’ll discuss these below, but let’s start off with an 
example that involves drawing balls from a box. 

Let’s say that we have a box containing five balls labeled A, B, C, D, E. We 
reach in and pick a ball and write down the letter. We then put the ball back in the 
box, shake the box around, and pick a second ball (which might be the same as the 
first ball) and write down this letter next to the first one, to the right of it (so the 
order matters). Equivalently, we can imagine having two boxes (a left one and a 
right one) with identical sets of balls labeled A, B, C, D, E, and we pick one ball 
from each box. We can think about it either way. The point is that the process of 
picking a ball is identical each time. We’ll refer to this kind of setup in various 
equivalent ways, but you should remember that all of the following phrases mean 
the same thing: 
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• identical trials 

• with replacement 

• repetitions allowed 

Basically, identical trials can be constructed by placing the ball you just drew 
back in the box, which means that it’s possible for a future ball to be a repeat of a 
ball you’ve already drawn. Of course, with things like dice and coins, the trials are 
inherently identical, which means that repetitions are automatically allowed. So we 
don’t need to talk about replacement. You don’t remove the dots on a die after you 
roll it! 

How many possible pairs of letters (where repetition is allowed and where the 
order matters) can we pick in the above five-ball example? More generally, how 
many different ordered sets of letters can we pick if we do n trials instead of only 
two? Or if we have N balls instead of five? 

In the case of N = 5 balls and n — 2 trials, the various possibilities are shown 
in Table 1.3. There are five possibilities for the first pick (represented by the five 
columns in the table), and then for each of these there are five possibilities for the 
second pick (represented by the five different entries in each column, or equivalently 
by the five rows). The total number of possible pairs of letters is therefore 5-5 =25. 
Remember that the order matters. So AC is different from CA, for example. 


AA 

B 

A 

CA 

DA 

EA 

AB 

B 

B 

CB 

DB 

EB 

AC 

B 

C 

CC 

DC 

EC 

AD 

B 

D 

CD 

DD 

ED 

AE 

B 

E 

CE 

DE 

EE 


Table 1.3: Drawing two balls from a box containing five balls, with replacement. 

If we do only n - 1 trial instead of two, then there are of course just 5 1 = 5 
possibilities. Instead of the square in Table 1.3, we simply have one column (just 
looking at the second letter in each pair), or one row (just looking at the first letter 
in each pair). 

If we increase the number of trials to n - 3, then the square in Table 1.3 becomes 
a cube, with the third axis (pointing into the page) representing the third pick. For 
each of the 5 2 possibilities in Table 1.3 for the first two letters, there are five pos¬ 
sibilities for the third, yielding 5 2 ■ 5 = 5 3 = 125 possible triplets in all. Again 
remember that the order matters. So AAD is different from ADA, for example. 

Similarly, n — 4 trials yield 5 3 • 5 = 5 4 = 625 possibilities. In this case the cor¬ 
responding geometrical shape is a 4-dimensional hypercube - not exactly an easy 
thing to visualize! Now, the point of listing out the possibilities in a convenient ge¬ 
ometrical shape is that it can help you do the counting. However, if the geometrical 
shape is a pain to visualize, then you shouldn’t bother with it. Fortunately there is 
no need to visualize higher-dimensional cubes. The above pattern of reasoning tells 
us that there are 5" different possible results when doing n trials of picking a letter 
from a 5-letter box, with replacement and with the order mattering. 
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More generally, if we do n trials involving a box that contains N letters instead 
of the specific number 5, then the total number of possible results is N n . This is 
true because there are N possible results for the first pick. And then for each of 
these N results, there are N possible results for the second pick, yielding N 2 results 
for the first two picks. And then for each of these N 2 results for the first two picks, 
there are N possible results for the third pick, yielding N 3 results for the first three 
picks. And so on. Remember (as we noted near the end of Section 1.2) that the 
total number of results of n trials here is the product (not the sum) of the N possible 
results for each trial. So we obtain N n (and not nN). 

Our main result in this section is therefore: The number of possible outcomes 
when picking n objects from a box containing N distinct objects (with replacement 
after each stage, and with the order mattering) is: 


Number of possible outcomes = N" 


(1.4) 


This N n “power-law” result is demonstrated pictorially for N - 3 in the branching 
tree in Fig. 1.2. At each vertex, we have a choice of three paths. A diagonally 
leftward path corresponds to picking the letter A, an upward path corresponds to 
the letter B, and a diagonally rightward path corresponds to the letter C. 



81 outcomes 
27 outcomes 
9 outcomes 

3 outcomes 


(N=3) 


Figure 1.2: The branching tree for ordered lists chosen from three objects, with replacement. 


After n — 1 trial, there are 3 possibilities: A, B, C. After n = 2 trials, there are 
3 2 = 9 possibilities: AA, AB, AC, BA, BB, BC, CA, CB, CC. After n — 3 trials, 
there are 3 3 = 27 possibilities. We haven’t labeled them in the figure, because they 
wouldn’t fit, but they are listed in Table 1.4 (grouped in a reasonable manner). 

After n — 4 trials, there are 3 4 = 81 possibilities: AAAA, AAAB, etc. We have 
indicated one of these in Fig. 1.2, namely ACAB. (The arrow points at the middle 
branch of the relevant top-level triplet.) If you want to list out all 81 possibilities, 
you can put an A in front of all 27 entries in Table 1.4, and then a B in front of all 
of them, and then finally a C in front of all of them. 

After another trial or two, the branches in Fig. 1.2 become too small to distin¬ 
guish the different outcomes. But as with the hypercubes mentioned above, there is 
fortunately no need to write down the entire branching tree. The tree simply helps 
in understanding the N n result. 

There are two differences between the present N" result and the Nl permutation 
result in Eq. (1.3). First, the factors in N n are all TV’s, because there are N possible 
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AAA 

BAA 

CAA 

AAB 

BAB 

CAB 

AAC 

BAC 

CAC 

ABA 

BBA 

CBA 

ABB 

BBB 

CBB 

ABC 

BBC 

CBC 

ACA 

BCA 

CCA 

ACB 

BCB 

CCB 

ACC 

BCC 

CCC 


Table 1.4: The 27 ordered lists of three objects chosen from a set of three objects, with 
replacement. 


outcomes for each of the identical trials (because we put the ball back in the box 
after each trial), whereas the factors in A'! start with N and decrease to 1 (because 
there is one fewer possibility at each stage; once a letter/number is used up, we can’t 
use it again). This difference is evident in Figs. 1.1 and 1.2. In the latter, the number 
of branches is always the same at each stage, whereas in the former, the number of 
branches decreases by one at each stage. The second difference is that the N" result 
involves your choice of the number n of trials (which may very well be larger than 
n\ there is no restriction on the size of n), whereas A'! involves exactly the N factors 
from N down to 1, because we’re looking at orderings of the entire set of N objects. 

Let’s now look at two classic examples, involving dice and cards, where the N n 
type of counting comes up. 


Example 1 (Rolling dice): If you roll a standard six-sided die twice (or equivalently, 
roll two dice), how many different possible ordered outcomes are there? 

Solution: There are six possibilities for what the first die shows, and six for the 
second. So there are 6 2 = 36 possibilities in all. If you want to list them out, they are 
shown in Table 1.5. 


1,1 

2,1 

3, 1 

4,1 

5,1 

6,1 

1,2 

2,2 

3,2 

4,2 

5,2 

6,2 

1,3 

2,3 

3,3 

4,3 

5,3 

6,3 

1,4 

2,4 

3,4 

4,4 

5,4 

6,4 

1,5 

2,5 

3,5 

4,5 

5,5 

6,5 

1,6 

2,6 

3,6 

4,6 

5,6 

6,6 


Table 1.5: The 36 possible ordered outcomes for two dice rolls. 


Since we are assuming that the order matters, a 2,5 is different from a 5,2. That is, 
rolling a 2 on the first die (or, say, the left die if you’re rolling both at once) and then a 
5 on the second die (or the right die) is different from rolling a 5 and then a 2. All 36 
outcomes in Table 1.5 are distinct. 
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Remark: This remark gets a little ahead of things, but as a precursor to our discussion 
of probability in the next chapter, we can ask the question: what is the probability of 
obtaining a sum of 7 when rolling two dice? If we look at Table 1.5, we see that six 
different outcomes yield a sum of 7. They are 1,6; 2,5; 3,4; 4,3; 5,2; 6,1. Since all 
36 possibilities are equally likely (because the probability of any number showing up 
on any roll is the same, namely 1/6), and since six of the possibilities yield the desired 
sum of 7, the probability of rolling a sum of 7 is 6/36 = 1/6 ~ 16.7%. From the table, 
you can quickly verify that 7 is the sum that has the most outcomes corresponding to 
it. So 7 is the most probable sum. We’ll discuss all the various nuances and subtleties 
about probability in the next chapter. For now, the lesson to take away from this is that 
the ability to count things is extremely important in calculating probabilities! * 


Example 2 (Flipping coins): If you flip a coin four times (or equivalently, flip four 
coins), how many different possible ordered outcomes are there? 

Solution: There are two possibilities (Heads or Tails) for what the first coin shows, 
and two for the second, and two for the third, and two for the fourth. So there are 
2 • 2 • 2 • 2 = 2 4 = 16 possibilities in all. If you want to list them out, they are shown 
in Table 1.6. 


HHHH 

THHH 

HHHT 

THHT 

HHTH 

THTH 

HHTT 

THTT 

HTHH 

TTHH 

HTHT 

TTHT 

HTTH 

TTTH 

HTTT 

TTTT 


Table 1.6: The 16 possible ordered outcomes for four coin flips. 

We have grouped the various possibilities into two columns according to whether the 
first coin shows a Heads or a Tails. Each column has eight entries, because 2 3 = 8 is 
the number of possible outcomes for three coins. (Just erase the first entry in each four- 
coin outcome, and then each column gives the eight possible three-coin outcomes.) 
Similarly, it’s easy to see why five coins yield 2 5 = 32 possible outcomes. We just 
need to take all 16 of the four-coin outcomes and tack on an H at the beginning, and 
then take all 16 again and tack on a T at the beginning. This gives 2 • 16 = 32 possible 
five-coin outcomes. 

Remark: As another probability teaser, we can ask: What is the probability of obtain¬ 
ing exactly two Heads in four coin flips? Looking at Table 1.6, we see that six out¬ 
comes have two Heads. They are HHTT, HTHT, HTTH, THHT, THTH, and TTHH. 
Since all 16 possibilities are equally likely (because the probability of either letter 
showing up on any flip is the same, namely 1/2), and since six of the possibilities 
yield the desired outcome of two Heads, the probability of obtaining two Heads is 
6/16 = 3/8 = 37.5%. As with the sum of 7 in the previous example, you can quickly 
verify by looking at Table 1.6 that two Heads is the most likely number. * 
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1.4 Ordered sets, repetitions not allowed 

In this section we will answer the question: How many different sets of n objects 
can be chosen from a given set of N objects, where the order matters and where rep¬ 
etitions are not allowed. This is the second of the four related scenarios summarized 
in Tables 1.11 and 1.12. The present scenario is the upper-right one in the table. In 
Section 1.5 we will answer the question for the case where the order doesn 't matter 
(and where repetitions are again not allowed). Note that in both of these cases (un¬ 
like in Section 1.3 where repetitions were allowed), we must of course have n < N, 
because we can’t use a given object/person more than once. 

When dealing with situations where repetitions are not allowed, it is customary 
to talk about committees of people, because repeating a person is of course not 
possible (no cloning allowed!). For example, we might have 13 people, and our 
goal might be to assign four of them to a committee, where the order within the 
committee matters. 

The answer to our initial question above (namely, how many ordered sets of n 
objects can be chosen from a given set of N objects, without replacement?) can be 
obtained quickly with only a slight modification of either of the AM or N" results 
in the preceding two sections (as we’ll see below). But let’s first get a feel for 
the problem by considering a simple example with small numbers (as is our usual 
strategy). Let’s say that we want to choose a committee of two people from a group 
of five people. We’ll assume that the positions on the committee are distinct. For 
example, one of the members might be the president. In other words, the order 
within the pair matters if we’re listing the president first. We’ll present two ways of 
counting the number of ordered pairs. The second method is the one that we’ll be 
able to extend to the general case of an ordered set of n objects chosen from a given 
set of N objects. 


Example (Two chosen from five): How many different ordered pairs of people can 
be chosen from a group of five people? 

First solution: Let the five given people be labeled A. B, C, D, E. We'll write down all 
of the possible ordered pairs of letters, temporarily including repetitions , even though 
we can't actually repeat a person. As we saw in Table 1.3, there are five possibilities 
for the first entry, and also five possibilities for the second entry, so we end up with 
the 5 by 5 square of possible pairs shown in Table 1.7. 


AA 

B 

A 

CA 

DA 

EA 

AB 

B 

B 

CB 

DB 

EB 

AC 

B 

C 

cc 

DC 

EC 

AD 

B 

D 

CD 

DD 

ED 

AE 

B 

E 

CE 

DE 

EE 


Table 1.7: Determining the number of ordered pairs chosen from five people. The 
five pairs in bold aren't allowed. 


1.4. Ordered sets, repetitions not allowed 


13 


However, the five pairs with repeated letters (shown in bold along the diagonal of the 
square) aren’t allowed, because the two people on the committee must of course be 
different. We therefore end up with 5 2 - 5 = 20 ordered pairs. So that’s our answer. 
More generally, if we want to pick an ordered pair from a group of N people, we can 
imagine writing down an N by N square, which yields N 2 pairs, and then subtracting 
off the N pairs with repeated letters, which leaves us with N 2 - N pairs. Note that this 
can be written as N(N - 1). 

Second solution: This second method is superior to the first one, partly because it is 
quicker, and partly because it can be generalized easily to larger numbers of people. 
(For example, we might want to pick an ordered group of four people from a group 
of 13 people.) Our strategy will be to pick the two committee members one at a time, 
just as we did at the end of Section 1.2 when we assigned people to seats. 

If we have two seats that need to be filled with the two committee members, then 
there are five possibilities for who goes in the first seat. And then for each of these 
possibilities, there are four possibilities for who goes in the second seat, because there 
are only four people left. So there are 5 • 4 = 20 ways to plop down the two people 
in the two seats. This is exactly the same reasoning as with the N\ ways to assign N 
people to N seats, except that we’re stopping the assignment process after two seats. 
So we have only the product 5 • 4 instead of the product 5 • 4 ■ 3 • 2 • 1. The number of 
ordered pairs we can pick from five people is therefore 5 • 4 = 20, as we found above. 
The preceding reasoning generalizes easily to the case where we pick ordered pairs 
from N people. There are N possibilities for who goes in the first seat, and then for 
each of these, there are N - 1 possibilities for who goes in the second seat. The total 
number of possible ordered pairs is therefore N(N - 1). 


Let’s now consider the general case where we pick an ordered set of n objects 
(without replacement) from a given set of N objects. Equivalently, we’re picking a 
committee of n people from a group N people, where all n positions on the com¬ 
mittee are distinct. 

The first method in the above example works for any value of N , provided that 
n = 2. However, for larger values of n, it quickly becomes intractable. As in 
Section 1.3, this is due to the fact that instead of the nice 2-D square we have in 
Table 1.7, we have a 3-D cube in the n = 3 case, and then higher-dimensional objects 
for larger values of n. Even if you don’t want to think about things geometrically, 
the analogous counting is still difficult, because it is harder to get a handle on the 
n-tuples with doubly-counted (or triply-counted, etc.) people. In Table 1.7 it was 
clear that we simply needed to subtract off five pairs from the 25 total (or more 
generally, N pairs from the N 2 total). But in the n — 3 case, it is harder to determine 
the number of triplets that need to be subtracted off from the naive answer of 5 3 . 
However, see Problem 1.3 if you want to think about how the counting works out. 

In contrast with the intractability of the first method above when applied to 
larger values of n, the second method generalizes quickly. If we imagine assigning 
people to n ordered seats, there are N ways to assign a person to the first seat. And 
then for each of these possibilities, there are N - 1 ways to assign a person to the 
second seat (because there are only N - 1 people left). So there are N(N - 1) 
possibilities for the first two seats. And then for each of these possibilities, there are 
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N -2 ways to assign a person to the third seat (because there are only N -2 people 
left). So there are N(N —1)(N — 2) possibilities for the first three seats. And so on, 
until there are N(N - 1) • • ■ (N - (n — 1)) possibilities for all n seats. The last factor 
here is N - (n — 1) because there are only N -{n— 1) people left when choosing the 
person for the «th seat, since n— 1 people have already been chosen. Alternatively, 
the last factor is N - (n - 1) because that makes there be n factors in the product; 
this is certainly true in the simple cases of n - 2 and n — 3. 

If we denote by ^P ,, the number of ordered sets of n objects chosen from N 
objects (without repetition), then we can write our result as 


N P n = N{N - \){N - 2) ■ ■ ■ {N - {n -\)). (1.5) 

If we multiply this by 1 in the form of (N — n)\/(N — n)!, we see that the number of 
ordered sets of n objects chosen from N objects can be written in the concise form. 


N Pn 


N\ 

(N - n)\ 


(ordered subgroups) 


( 1 . 6 ) 


The ordered sets of n objects chosen from N objects are often called partial per¬ 
mutations (because we’re permuting a partial set of the N objects) or k-permutations 
(because the letter k is often used in place of the n that we’ve been using). Note that 
nPn = Nl (remember that 0! = 1) of course, because if n = N then we’re form¬ 
ing an ordered list of all N objects. That is, we’re forming a permutation of all N 
objects. So the product in Eq. (1.5) runs from N all the way down to 1. 

As mentioned near the beginning of this section, our result for ,v P n can be ob¬ 
tained with a quick modification to the reasoning in either of the preceding two 
sections. From Eq. (1.5) we see that the permutation reasoning in Section 1.2 is 
modified by simply truncating the product N(N - 1 )(N - 2) • • • after n terms, in¬ 
stead of including all N terms. The modification to Fig. 1.1 is that we stop the 
branching at the «th level. The reasoning in Section 1.3 (involving ordered sets but 
with repetitions allowed) is modified by simply replacing the N' 1 product of equal 
factors N with the N(N - 1 )■■■ (N - (n - 1)) product of decreasing factors. The 
factors get smaller because at each stage there is one fewer object/person available, 
since repetitions aren’t allowed. (These decreasing factors lead to the n < N re¬ 
striction, as we noted above.) The modification to Fig. 1.2 is that we decrease the 
number of branches by one at each stage (with the restriction n < N). 


1.5 Unordered sets, repetitions not allowed 

In the preceding section, we considered committees/subgroups in which the order 
mattered. But what if the order doesn’t matter? For example, how many ways 
can we pick a committee of four people from 13 people, where all members of the 
committee are equivalent? This is the third of the four related scenarios summarized 
in Tables 1.11 and 1.12. The present scenario is the lower-right one in the table. As 
usual, let’s start off with an example involving small numbers. 
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Example (Two chosen from five): How many different unordered pairs of people 
can be chosen from a group of five people? 

First solution: Let the five given people be labeled A, B, C, D, E. We'll write down 
all of the possible pairs of letters, including repetitions (even though we can’t actu¬ 
ally repeat a person) and including different orderings (even though the order doesn’t 
matter). As in Table 1.7, there are five possibilities for the first entry, and also five 
possibilities for the second entry, so we end up with the 5 by 5 square of possible pairs 
shown in Table 1.8. 


AA 

B 

A 

C 

A 

DA 

EA 

AB 

B 

B 

C 

B 

DB 

EB 

AC 

B 

C 

C 

c 

DC 

EC 

AD 

B 

D 

c 

D 

DD 

ED 

AE 

B 

E 

c 

E 

DE 

EE 


Table 1,8: Determining the number of unordered pairs chosen from five people. The 
five pairs in bold aren't allowed, and the other pairs are all double counted. 


However, as with Table 1.7, the five pairs with repeated letters (shown in bold along 
the diagonal of the square) aren't allowed, because the two people on the committee 
must of course be different. Additionally, since we aren’t concerned with the order 
within a given pair, the lower-left triangle of 10 pairs in the table is equivalent to the 
upper-right triangle of 10 pairs. These two triangles are shown separated in Table 1.9. 
We see that we have counted every pair twice in Table 1.8. For example, AB represents 
the same pair as BA, and CE is the same as EC, etc. We therefore have (5 2 —5)/2 = 10 
unordered pairs. The subtraction of 5 gets rid of the pairs with repeated letters, and 
the division by 2 gets rid of the double counting due to the duplicate triangles. 





BA 

CA 

DA 

EA 

AB 




CB 

DB 

EB 

AC 

BC 




DC 

EC 

AD 

BD 

CD 




ED 

AE 

BE 

CE 

DE 





Table 1.9: Equivalent sets of unordered pairs of people. 


More generally, if we want to pick an unordered pair from a group of N people, we can 
imagine writing down an A by A square, which yields A 2 pairs, and then subtracting 
the A pairs with repeated letters. This gives A 2 - A pairs. But we must then divide by 
2 to get rid of the double counting; for every pair XY there is an equivalent pair YX. 
This yields (A 2 — A)/2 unordered pairs, which can also be written as A(A - l )/2. 

Second solution: As in the second solution in the example in Section 1.4, we can 
imagine picking the committee members one at a time. And as before, this method 
will generalize quickly to larger numbers of people. If we have two seats that need 
to be filled with the two committee members, there are five possibilities for who goes 
in the first seat. And then for each of these possibilities, there are four possibilities 
for who goes in the second seat, because there are only four people left. So there are 
5 • 4 = 20 ways to plop down the two people in the two seats. However, we double 
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counted every pair in this reasoning; we counted the pair XY as distinct from the pair 
YX. So we need to divide by 2 since we don’t care about the order. The number of 
unordered pairs we can pick from five people is therefore (5 ■ 4)/2 = 10, as we found 
above. 

The preceding reasoning generalizes easily to the case where we pick unordered pairs 
from N people. There are N possibilities for who goes in the first seat, and then for 
each of these, there are N - 1 possibilities for who goes in the second seat. This 
gives N(N - 1) possibilities. But since we don’t care about the order, this reasoning 
double counts every pair. We therefore need to divide by 2, yielding the final result of 
N(N - l)/2, as we found above. 


Let’s now consider the general case where we pick an unordered set of n objects 
(without replacement) from a given set of N objects. Equivalently, we’re picking a 
committee of n people from a group N people, where all n positions on the com¬ 
mittee are equivalent. 

As in Section 1.4, the first method above works for any value of N , provided 
that n = 2. But for larger values of n, it again quickly becomes intractable. In 
contrast, the second method generalizes easily. From Eq. (1.6) we know that there 
are ^P, , = N\/(N - n)\ ways of assigning people to n ordered seats. However, 
this expression counts every unordered /;-tuplet n ! times, due to the fact that our 
permutation result in Eq. (1.3) tells us that there are n ! ways to order any group of 
n people. In our ,v P n counting, we counted all of these groups as distinct. Since 
they are not distinct in the present scenario where the order doesn’t matter, we must 
divide by n\ to get rid of this overcounting. For example, if we’re considering 
committees of three people, the six triplets XYZ, XZY, YXZ, YZX, ZXY, ZYX are 
distinct according to the a rP n counting. So we must divide by 3! = 6 to get rid of 
this overcounting. We therefore arrive at the general result: The number of sets of 
n objects that can be chosen from N objects (where the order doesn’t matter, and 
where repetitions are not allowed) is 


n p„ = m 

n\ n\(N - n)\ 


(1.7) 


This result is commonly denoted by the binomial coefficient which is read as 
“N choose n.” We’ll have much more to say about binomial coefficients in Sec¬ 
tion 1.8. Another notation for the above result is ,vC'„, where the C stands for 
“combinations.” The result in Eq. (1.7) can therefore be written as 


= /AA = IV! 

N n (ft J n\(N - n)\ 


(unordered subgroups) 


( 1 - 8 ) 


For example, the number of ways to pick an unordered committee of four people 
from six people is 



(1-9) 


You should check this result by explicitly listing out the 15 groups of four people. 
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Note that because of our definition of 0! = 1 in Section 1.1, Eq. (1.8) is valid 
even in the case of n = N, because we have = Nl/NlOl = 1. And indeed, there 
is only one way to pick N people from N people. You simply pick them all. Another 
special case is n = 0. This gives = Nl/OINI = 1. It’s a matter of semantics to 
say that there is one way to pick zero people from N people; you simply don’t pick 
any of them, and that’s the one way. But we’ll see later on, especially when dealing 
with the binomial theorem, that ( , j'j j = 1 makes perfect sense. 

In the end, the only difference between the result in this section (where 
the order doesn’t matter) and the ; \/ P n result in Section 1.4 (where the order does 
matter) is the division by n\ to get rid of the overcounting. Remember that neither 
of these results allows repetitions. 


Example (Equal binomial coefficients): We found above that = 6!/(4!2!) = 15. 
But note that ( 2 ) = 6!/(2!4!) also equals 15. Both and involve the product of 
2! and 4! in the denominator, and since the order doesn’t matter in this product, the 
result is the same. We also have, for example, (^ j = ( g 1 ). Both of these binomial 
coefficients equal 165. In short, any two n’s that add up to N yield the same value of 

(?)• 

(a) Demonstrate this fact mathematically. 

(b) Explain in words why it is true. 

Solution: 

(a) Let the two n values be n\ and n 2 . If they add up to N , then they must take the 
forms of n\ = a and 112 = N - a, for some value of a. (The above example with 
N = 11 was generated by either a = 3 or a = 8.) Our goal is to show that 
equals And indeed, 

N\ _ N\ 

n\) n\\(N — n\)\ 

N\ _ N\ 

no) n 2 \{N-n 2 )\ 

The order of the a \ and (N — a) \ factors in the denominators doesn’t matter, so 
the two results are equal, as desired. 

In practice, when calculating by hand or on a calculator, you want to cancel 
the larger of the factorials in the denominator. For example, you can quickly 
cancel the 8! in both (y) and (g 1 ) and write them as (11 • 10 - 9)/(3-2-1) = 165. 

(b) Imagine picking n objects from N objects and then putting them in a box. The 
number of ways to do this is ^ j. But note that you generated two sets of objects 
in this process. You generated the n objects in the box, and you also generated 
the N - n objects outside the box. There’s nothing special about being inside 
the box versus being outside, so you can equivalently consider your process as 
a way of picking the group of N - n objects that remain outside the box. Said in 
another way, a perfectly reasonable way of picking a committee of n members 
is to pick the N — n members who are not on the committee. There is a one- 
to-one correspondence between each set of n objects and the complementary 


a\{N-a)\ 


( 1 - 10 ) 


N\ 


N\ 


(N - a)\(N - (IV - a))\ (N - a)\a\ ' 
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(remaining) set of N - n objects. The number of different sets of n objects is 
therefore equal to the number of different sets of N - n objects, as we wanted to 
show. 


Let’s now mix things up a bit and consider an example involving a committee 
that consists of distinct positions, but with some of the positions being held by more 
than one person. 


Example (Three different titles): From ten people, how many ways can you form a 
committee of seven people consisting of a president, two (equivalent) vice presidents, 
and four (equivalent) regular members? We’ll give four solutions. 


First solution: We can start by picking an ordered set of seven people to sit in seven 
seats in a row. There are jq P-j = 10-9-8-7-6-5-4 ways to do this. Let’s assume 
that the president goes in the first seat, the two vice presidents go in the next two, and 
the four regular members go in the last four. Then the order in which the two vice 
presidents sit doesn’t matter, so \qPj overcounts the number of distinct committees 
by a factor of 2!. Likewise, the order in which the four regular members sit doesn’t 
matter, so io/V overcounts the number of distinct committees by an additional factor 
of 4!. The actual number of distinct committees is therefore 


10^7 

4!2! 


10-9-8-7-6-5-4 
4! ■ 2! 


12,600. 


( 1 . 11 ) 


Second solution: There are 10 (or more precisely, (\°)) ways to pick the presi¬ 
dent. And then for each of these possibilities, there are (j) ways to pick the two vice 
presidents from the remaining nine people; the order doesn’t matter between these two 
people. And then for each scenario of president and vice presidents, there are ways 
to pick the four regular members from the remaining seven people; again, the order 
doesn’t matter among these four people. The total number of possible committees is 
therefore 


10 9-8 7-6-5-4 
IT ' ~2! 4! 


12,600. 


(L12) 


We chose not to cancel the factor of 4 in (JA here, so that the agreement with Eq. (1.11) 
would be clear. 


Third solution: There is no reason why the president has to be picked first, so let’s 
instead pick, say, the four regular members first, and then the two vice presidents, and 
then the president. (Other orders will work perfectly well too.) There are ways to 

pick the four regular members, and then ways to pick the two vice presidents from 

the remaining six people, and then (j) ways to pick the president from the remaining 
four people. The total number of possible committees is therefore 


X 


10-9-8-7 

4! 


6-5 

~ 2 \~ 


— = 12,600. 


0-13) 


We see that the order in which you pick the various subparts of the committee doesn’t 
matter. It had better not matter, of course, because the number of possible committees 





1.5. Unordered sets, repetitions not allowed 


19 


is a definite number and can’t depend on your method of counting it (assuming your 
method is a valid one!). Mathematically, all of the above solutions yield the same 
result because all of the calculations have the same product 10 • 9 • 8 ■ 7 • 6 • 5 ■ 4 in the 
numerator and the same product 1! • 2! • 4! in the denominator. 


Fourth solution: We can do the counting in yet another way. We can first pick all 
seven members; there are ways to do this. We can then pick the president from 

these seven members; there are (ways to do this. We can then pick the two vice 
presidents from the remaining six members; there are ^ ways to do this. We’re then 
stuck with the remaining four members as regular members. The total number of 
possible committees is therefore 


10-9-8 

3! 


7 

IT 


6-5 

~ 2 !~ 


= 12,600, 


(1.14) 


If we multiply this expression by 4 over 4, then we have all the same factors in the 
numerator and denominator as we had in the previous solutions. 

Of course, after picking the seven members, we could alternatively then pick, say, the 
four regular members from these seven, and then pick the two vice presidents from 
the remaining three. You can verify that this again gives 12,600 possible committees. 
The moral of all the above solutions is that there are usually many different ways to 
count things! 


For another example, let’s do some card counting. A standard deck of cards 
consists of 52 cards, with four cards (the four suits) for each of the 13 values; 2, 3, 
..., 9, 10, J(Jack), Q(Queen), K(King), A(Ace). There is a nearly endless number 
of subgroup-counting examples relevant to the card game of poker. In the following 
example, the ordering will matter in some cases but not in others. 


Example (Full houses): How many different full-house hands are possible in stan¬ 
dard five-card poker? A full house consists of three cards of one value plus two cards 
of another. An example is 999QQ. (The suits don’t matter.) 

Solution: Our strategy will be to determine how many hands there are of a given 
type (999QQ is one type; 88833 is another; etc.) and then multiply this result by the 
number of different types. 

If the hand consists of, say, three 9’s and two queens, then there are (j) = 4 ways to 

choose the three 9’s from the four 9’s (the four suits) in the deck, and likewise (j) = 6 
ways to choose the two Q’s from the four Q’s in the deck. So there are 4 • 6 = 24 
possible full houses of the type 999QQ. Note that we used 4C3 = (3) and 4 C 2 = (2) 
here, instead of 4P3 and 4 P 2 , because the order of the 9’s and the order of the Q’s in 
the hand doesn’t matter. 

How many different AAABB types are there? There are 13 different values of cards 
in the deck, so there are 13 ways to pick the value that occurs three times. And then 
there are 12 ways to pick the value that occurs twice, from the remaining 12 values. 
So there are 13 • 12 = 156 different types. Note that this result is 13B2 = 13 • 12, and 
not 13Ci = (2^) = (13 ' 12)/2, because the order does matter. Having three 9’s and 
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two Q’s is different from having three Q’s and two 9’s. The total number of possible 
full-house hands is therefore 

13 ■ 12 • ■ Q = 156 • 24 = 3,744. (1.15) 

This should be compared with the total number of possible poker hands, which is 
much larger: = 2,598,960. The 3,744 full-house hands account for only about 

0.14% of the total number of hands. Many more examples of counting poker hands 
are given in Problem 1.10. 

Remark: With regard to the 13 • 12 = 156 number of AAABB types, you can alter¬ 
natively arrive at this by first noting that there are ( j*) = (13 • 12)/2 possibilities for 
the two values that appear in the hand, and then realizing that you need to multiply by 
2 because each pair of values represents two different types, depending on which of 
the two values occurs three times. If poker hands instead consisted of only four cards, 
and if a full house were defined to be a hand of the type AABB, then the number 
of different types would be ( 2 3 )> because the A's and B’s are equivalent; each occurs 
twice. * 


1.6 What we know so far 


In Sections 1.2 through 1.5 we learned how to count various things. Here is a 
summary of the results: 


Section 1.2: Permutations of N objects: 

N\ 


Section 1.3: Ordered sets (n objects chosen from N), with repetitions al¬ 
lowed: 

N n 

Section 1.4: Ordered sets ( n objects chosen from AO, with repetitions not 
allowed: 

n N\ 

N* n — ttt rr 

(N - n)\ 

Section 1.5: Unordered sets (n objects chosen from N ), with repetitions not 
allowed: 

_ /ao _ m 

N ” \n ) n\(N — n)\ 


As we derived these results, we commented along the way on how they relate to 
each other. It is instructive to pause for a moment and collect all of these relations 
in one place. They are shown in Fig. 1.3 and summarized as follows. 

If we start with the AM result for permutations, we can obtain the ^Pn result 
(for subgroups where the order matters) by simply truncating the product N (N - 
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1)(N - 2) ■ ■ ■ after n terms instead of including all N terms down to 1. The () 
result (for subgroups where the order doesn’t matter) is then obtained from a iP n 
by dividing by n ! to get rid of the overcounting of the equivalent subgroups with 
different orderings. 

If we instead start with the N" result for a set of n objects chosen from N objects 
with replacement, we can obtain the ^P n result (where there is no replacement) by 
simply changing the product of the n factors N ■ N ■ N ■ ■ ■ to the product of the n 
factors N(N - 1)(N - 2) ■ ■ •. Each factor decreases by 1 because there is one fewer 
possibility for each pick, since there is no replacement. 


(permutations) 



1.7 Unordered sets, repetitions allowed 

In Fig. 1.3, the Nl result for permutations is somewhat of a different result from the 
other three (the ones in the righthand column), in that these three involve picking a 
general number n of objects from N objects. Permutations involve the special case 
where n = N. So let’s concentrate on the three results in the righthand column. 
Since there are two possibilities with regard to replacement (we can replace things 
or not), and also two possibilities with regard to the order mattering (it matters or 
it doesn’t), there are 2 • 2 =4 possible scenarios when picking a general number 
n of objects from N objects. One scenario is therefore missing in the righthand 
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column. This missing scenario is the one where we pick n objects from N objects, 
with replacement, and with the order not mattering. This is indicated in Table 1.10. 



with 

without 


replacement 

replacement 

order 

N n 

N\ 

matters 

(. N-n)\ 

order 


N\ 

doesn’t matter 


n\(N-n)\ 


Table 1.10: The missing result: unordered sets with repetitions allowed. 

The missing scenario indicated by the question mark doesn’t come up as often 
as the other three when solving standard problems in combinatorics and probability. 
And it also doesn’t relate to the other three as simply as they relate to each other in 
Fig. 1.3. But it certainly does have its uses, so let’s try to figure it out. We can’t just 
let the question mark sit there, after all! An everyday example of this scenario is 
the game of Yahtzee™, where five dice are rolled in a group. The order of the dice 
doesn’t matter, so the setup is equivalent to drawing n — 5 balls from a box (with 
replacement, and with the order not mattering), with the balls being labeled with the 
N — 6 numbers 1 through 6. 

Before determining what the question mark in Table 1.10 is, let’s do a few ex¬ 
amples to get a feel for things. These examples will allow us to see a pattern and 
make a conjecture. 


Example 1 (n = 4 chosen from N = 3): Pick n = 4 letters (with replacement, and 
with the order not mattering) from a hat containing N = 3 letters: A, B, C. How many 
different sets of four letters are possible? It turns out that there are only four different 
basic types of sets, so let’s list them out. 

• All four letters can be the same, for example AAAA. There are three sets of this 
type, because the common letter can be A, B, or C. 

• We can have three of one letter and one of another, for example AAAB. (Re¬ 
member that the order doesn’t matter, so AAAB, AABA, ABAA, and BAAA 
are all equivalent.) There are 3-2 = 6 sets of this type, because there are three 
choices for the letter that appears three times, and then for each of these choices 
there are two choices for the letter that appears once. 

• We can have two of one letter and two of another, for example AABB. There are 
three sets of this type, because there are = 3 ways to choose the two letters 

that appear. Note that there are only Q) ways, and not 3-2 = 6 ways as there 
are for the AAAB type, because two A’s and two B's are the same as two B’s 
and two A’s. 

• We can have two of one letter, one of a second, and one of the third; for example 
AABC. There are three sets of this type, because there are three ways to choose 
the letter that appears twice. 
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We can summarize the above results for the numbers of the different types of sets: 

AAAA AAAB AABB AABC 

3 6 3 3 

The total number of ways to pick four letters from a set of three letters (with replace¬ 
ment, and with the order not mattering) is therefore 3 + 6 + 3 + 3 = 15. Note that 15 
can be written as . Be careful not to confuse the actual number of different sets (15 
here) with the number of different types of sets (4 here). 

Remark: In the above discussion, we counted the number of different possible sets. 
We weren’t concerned with the probability of actually obtaining a given set. Although 
we won’t tackle probability until Chapter 2, we'll make one comment here. 

Consider the set with, say, three C’s and one A, which we label as CCCA; remember 
that the order of the letters doesn’t matter. And consider another set with, say, four 
B’s, which we label as BBBB. Each of these sets is one of the 15 sets that we counted 
above. As far as counting the possible sets goes, these two sets count equally. How¬ 
ever, if we’re concerned with the probability of obtaining a given set, then we must 
take into account the fact that while four B’s can occur in only one way, three C’s and 
one A can occur in four different ways, namely CCCA, CCAC, CACC, and ACCC. 
Three C’s and one A are therefore four times as likely to occur as four B's. (We’re 
assuming that the three letters A, B, C are equally likely to be drawn on each of the 
four draws.) 

Note that when we list out each of the ordered sets (such as CCCA, CCAC, CACC, 
and ACCC) associated with a particular unordered set, we are now in the realm of the 
N" result in Eq. (1.4). Each of the four ordered sets just mentioned counts equally 
as one set in the total number of N " = 3 4 = 81 ordered sets. For more on how this 
example relates to the N n result in Eq. (1.4), see the last example in this section. 

But to emphasize, in the present section this difference in probabilities is irrelevant. 
We are simply counting the number of different unordered sets, paying no attention 
to the actual probability of each set. We'll have plenty to say about probability in 
Chapter 2. * 


Example 2 (n = 3 chosen from N = 4): Pick n = 3 letters (with replacement, and 
with the order not mattering) from a hat containing N = 4 letters: A, B, C, D. There 
are now only three different basic types of sets. We’ll just list them out, along with the 
number of each: 


AAA AAB ABC 

4 12 4 

You can verify that the three numbers here are correct. For example, there are 12 sets 
of the AAB type, because there are four ways to choose the letter that appears twice, 
and then three ways to choose the letter that appears once. Remember that there is a 
fourth letter D now. 

The total number of ways to pick three letters from a set of four letters (with replace¬ 
ment, and with the order not mattering) is therefore 4 + 12 + 4 = 20. Note that 20 can 
be written as ■ 
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Example 3 (n = 5 chosen from N = 3): Pick n = 5 letters (with replacement, and 
with the order not mattering) from a hat containing N = 3 letters: A, B, C. There are 
now five different basic types of sets, and we have: 

AAAAA AAAAB AAABB AAABC AABBC 

3 6 6 3 3 

Again you can verify that the five numbers here are correct. For example, there are six 
sets of the AAABB type, because there are three ways to choose the letter that appears 
three times, and then two ways to choose the letter that appears twice. And there are 
three sets of the AABBC type, because there are three ways to choose the letter that 
appears once. 

The total number of ways to pick five letters from a set of three letters (with replace¬ 
ment, and with the order not mattering) is therefore 3 + 6 + 6 + 3 + 3 = 21. Note that 
21 can be written as ( 2 )- 


The above results suggest that the answer to our problem (namely, how many 
ways are there to pick n objects from N objects, with replacement, and with the 
order not mattering?) most likely involves binomial coefficients. And a little trial 
and error shows that the above three results of and Q) are all consistent 

with the expression We’ll now explain why this is indeed the general 

form of the answer. 

In the above examples, we concentrated on the different basic types of sets. 
However, although this gave us enough information to make an educated guess for 
the general result, this method becomes intractable when dealing with large values 
of n and N. We therefore need to think about the counting in a different way. This 
new way of counting is the following. 

The different sets (of which there are but let’s pretend we don’t know 

this yet) are characterized by how many A’s, B’s, etc., there are. And since the order 
of the letters doesn’t matter, we might as well list out the n letters by putting all the 
A’s first, and then all the B’s, and so on. We can therefore imagine putting n objects 
in a row and labeling them with various letters in alphabetical order. (We’ll write 
these objects as stars, for a reason we’ll get to below.) As a concrete example, let’s 
work with n — 6 and N - 3. So we’re picking six letters from A, B, C. Two possible 
sets are shown in Fig. 1.4. The second set happens to have no A’s. 


A 

A 

A 

B 

C 

C 

★ 

★ 

★ 

★ 

★ 

★ 

B 

B 

B 

B 

B 

c 

★ 

★ 

★ 

★ 

★ 

★ 


Figure 1.4: Two possible unordered sets of n = 6 objects chosen with replacement from 
N = 3 objects. 
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In writing down an arbitrary set of letters, the decision of how many of each 
letter to include is equivalent to the decision of where to put the transitions between 
the letters. If these transitions are represented by vertical dividing lines, then the 
two sets in Fig. 1.4 can be represented by the two configurations in Fig. 1.5. 

★ ★ ★ | | ★ ★ 

!★★★★★!★ 

Figure 1.5: Separating the different letters in Fig. 1.4. 

The number of stars before the first dividing line is the number of A’s (which is zero 
in the second configuration), the number of stars between the two dividing lines 
is the number of B’s, and the number of stars after the second dividing line is the 
number of C’s. Each different placement of the dividing lines produces a different 
set of letters. So our task reduces to determining how many different ways there are 
to plop down the dividing lines. 

In the present setup with n — 6 and N — 3, we have six stars and two dividing 
lines, so we have eight things in all. We can therefore imagine eight spaces lined up 
that need to be filled, in one way or another, with six stars and two bars. 2 The two 
configurations in Fig. 1.5 then become the two shown in Fig. 1.6. 


★ ★ ★ ★ ★ | ★ 


Figure 1.6: The stars-and-bars representations of the two sets in Fig. 1.4. 

How many different ways can we plop down the stars and bars? The answer is 
just (?) = 28, because we simply need to pick two of the eight spaces as the ones 

where the bars go. Equivalently, the number of ways is (?) = 28, because we need 
to pick six of the eight spaces as the ones where the stars go. As an exercise, you 
can verify this result of 28 by explicitly counting the different sets, as we did in the 
above examples. 

We now see where the (" + ^ A ^ 1) ) result comes from. If we have N different 
letters, then we have N - 1 bars signifying the transitions between them. If we’re 
picking n letters (with replacement, and with the order not mattering), then we have 
n stars and N - 1 bars that need to be plopped down in a total of « + (A - 1) spaces 
arranged in a line. There are ("^^j -1 '), or equivalently ( n+ <W-t)^ ways to do this. 

For example, if we’re picking n = 6 letters (with replacement, and with the 
order not mattering) from A = 4 letters A, B, C, D, then there are (?) = 84 ways to 
do this. The ABBBDD set, for example, is represented by the configuration of six 
stars and three bars shown in Fig. 1.7. 

2 In keeping with the normal convention, we’ll refer to the dividing lines as bars. The objects that 
we’re placing down are then stars and bars, and who could possibly dislike a rhyming name like that? 
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Figure 1.7: The stars-and-bars representation of the ABBBDD set for n = 6 and N = 4. 


If we let N U„ (with the U standing for “unordered”) denote the number of ways 
to pick n objects from N objects (with replacement, and with the order not matter¬ 
ing), we can write our result as 


N Ur, 


+ (N — 1) j 


(unordered, with repetition) 


(1.16) 


Remember that it is N (the number of distinct letters, or whatever) that has the 1 
subtracted from it in the ' 'j expression, and not n (the number of picks you 

make). We can now fill in the missing result in Table 1.10, as shown in Table 1.11. 



with 

without 


replacement 

replacement 

order 

matters 

N n 

N\ 

( N-ri)\ 

order 

(n+N- 1)! 

N\ 

doesn’t matter 

n\{N-\)\ . 

n\(N-ri)\ v 

A 

(ri+N -1 

H-sr 

«r 


Table 1.11: Filling in the missing result in Table 1.10. 

Let’s quickly verify that Eq. (1.16) holds in two simple cases. 

• If n is arbitrary and N = 1, then Eq. (1.16) gives i U n = = 1. This is 

correct, because if there is only N - 1 possible result (call it A) for each of 
the n picks, then there is only one combined result for all n picks, namely 
AAAA.... 

• If n = 1 and N is arbitrary, then Eq. (1.16) gives nU\ = (^[) = N. This is 
correct, because if there are N possible results (call them Aj, A 2 ,..., An) for 
each pick, then there are simply N possible results for the n — 1 pick, namely 
Aj or A 2 or ... An- 


See Problems 1.11 and 1.12 for two other simple cases, namely n — 2 with 
arbitrary N, and arbitrary n with N — 2. See also Problem 1.13 for an alternative 
proof of Eq. (1.16) that doesn’t use the “stars and bars” reasoning. 

Let’s now do an example that might seem like a non sequitur at first, but in fact 
is essentially the same as the “with replacement, order doesn’t matter” case that 
we’ve been discussing in this section. 
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Example (Dividing the money): You have ten one-dollar bills that you want to divide 
among four people. How many different ways can you do this? For example, if the 
four people are seated in a line, they might receive amounts of 4,0,3,3, or 1,6,2,1, or 
0, 10,0,0, etc. The dollar bills are identical, but the people are not. So, for example, 
4,0,3,3 is different from 3,4.0,3. 


Solution: This setup is equivalent to a setup where we draw n = 10 letters (with 
replacement, and with the order not mattering) from a box containing N = 4 letters A, 
B, C, D. This equivalence can be seen as follows. Label the four people as A, B, C, 
D. Reach into the box and pull out a letter, and then give a dollar to the person labeled 
with that letter. Replace the letter, and then pick a second letter and give a dollar to 
that person. And so on, a total of ten times. This process will generate a string of 
letters, for example, CBBDCBDBAD. 


Now, since it doesn't matter when you give a particular dollar bill to a person, the 
order of the letters in the string is irrelevant. The above string is therefore equivalent 
to ABBBBCCDDD, if we arbitrarily list the letters in alphabetical order. The four 
people A, B, C, D receive, respectively, 1, 4, 2, 3 dollars. There is a one-to-one 
correspondence between unordered strings of letters and the partitions of the dollar 
bills. So the desired number of partitions is equal to the number of unordered strings 
of n = 10 letters chosen from N = 4 letters, which we know from Eq. (1.16) is equal 
to 


( 10 ; < -r I) ) 



(1.17) 


We have now seen two equivalent scenarios where the stars-and-bars result in 
Eq. (1.16) applies: unordered strings of letters/objects, and money partitions. An¬ 
other equivalent scenario is the number of ways to throw n identical balls into N 
boxes. This is equivalent to the money-partition scenario, because in that setup 
we’re basically throwing n dollar bills at N people. Yet another equivalent scenario 
is the number of ways that N non-negative integers can add up to n. For example, if 
N — 4 and n = 10, then one way is2 + 2+ 4 + 2 = 10. Another is 3 + 0 + 6 + 1 = 10. 
And yet another is 0 + 3 + 6 + 1 = 10. (We’ll assume that the order of the numbers 
matters.) This is equivalent to the money-partition scenario, because the four num¬ 
bers in the preceding sums correspond to the amounts of money that the four people 
get. 

The common underlying process in all of these equivalent scenarios is that we’re 
always effectively just throwing down n identical objects onto N spaces. In our 
original setup with strings of letters, you can imaging throwing n darts at N letters; 
the number of times a number gets hit is the number of times we write it down. 
(Picking a letter randomly from a box is equivalent to throwing a dart randomly at 
the letters in the box.) In the case of N non-negative integers adding up to n, you 
can imaging throwing n darts at N spaces in a line; the number of times a space gets 
hit is the number that we write down in that space when forming the sum. 

Let’s now check that the two entries in the left column in Table 1.11 relate 
properly, at least in one particular case. The N n result in the table holds for ordered 
sets (with replacement), while the result holds for unordered sets (again 
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with replacement). We should be able to extract the former from the latter if we 
consider how many different ways we can order each of our unordered sets. This is 
demonstrated in the following example for n — 4 and N - 3. 


Example (Reproducing N n ): The N" result tells us that there are 3 4 = 81 different 
ordered sets that are possible when drawing n = 4 objects (with replacement) from a 
hat containing N = 3 objects. Let's reproduce this result by considering the number of 
possible orderings of each of the four different basic types of unordered sets (AAAA, 
AAAB, AABB, AABC) in Example 1 at the beginning of this section. This is clearly 
a laborious way of producing the simple 3 4 = 81 result for ordered sets, but it can’t 
hurt to check that it does indeed work. 

Solution: From Example 1, we know that the numbers of unordered sets of the four 
types (AAAA, AAAB, AABB, AABC) are, respectively, 3, 6, 3, and 3. 

Consider the first type, where all the letters are the same. For a given unordered set of 
this type, there is only one way to order the letters, since they are all the same. So the 
total number of ordered sets associated with the three unordered sets of the AAAA 
type is simply 3-1 = 3. 

Now consider the second type of set, with three of one letter and one of another. For a 
given unordered set of this type, there are four ways to order the letters (for example, 
AAAB, A ABA, ABA A, BAA A). So the total number of ordered sets associated with 
the six unordered sets of the AAAB type is 6 ■ 4 = 24. 

Now consider the third type of set, with two of one letter and two of another. For a 
given unordered set of this type, there are six ways to order the letters (for example, 
AABB, ABAB, ABBA, BAAB, BABA, BBAA). So the total number of ordered sets 
associated with the three unordered sets of the AABB type is 3 • 6 = 18. 

Finally, consider the fourth type of set, with two of one letter, one of another, and one 
of the third. For a given unordered set of this type, there are 12 ways to order the 
letters. (We won’t list them out, but for AABC there are four places to put the B. and 
then three places to put the C. Alternatively, there are ( 2 ) = 6 ways to assign the A’s, 
and then two possible ways to order B and C.) So the total number of ordered sets 
associated with the three unordered sets of the AABC type is 3 • 12 = 36. 

The complete total of the number of ordered sets involving n = 4 letters chosen from 
N = 3 letters (with replacement) is therefore 3 +24+18 +36 = 81, as desired. 


As mentioned at the beginning of this section, the case of unordered sets with 
replacement doesn’t come up as often in standard probability setups as the other 
three cases in Table 1.11. One reason for this is that the sets in each of the other 
three cases are all equally likely (assuming that all of the objects in a box at a given 
time are equally likely to be drawn), whereas the sets in the case of unordered draws 
with replacement are not all equally likely (as we noted in the remark in the first 
example in this section). Counting sets in the latter case is therefore not as useful in 
probability as it is in the other three cases. 

As we noted at the beginning of Section 1.3, the phrase “with replacement” 
(or “with repetition”) means the same thing as “identical trials.” Examples include 
rolling dice, flipping coins, and drawing balls from a box with replacement. In 
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contrast, “without replacement” means the same thing as “depleting trials,” that 
is, trials where the number of possible outcomes decreases by 1 after each trial. 
Examples include picking committees of people (because a given person can’t be 
repeated, of course), assigning people to seats, and drawing balls from a box without 
replacement. 

As far as the “order matters” and “order doesn’t matter” descriptors in Table 1.11 
go, you can simply imagine the ordered objects appearing in a line, and the un¬ 
ordered objects appearing in an amorphous group, or blob. We can therefore de¬ 
scribe the four possibilities in Table 1.11 with the four phrases given in Table 1.12. 
A standard example of each case is given in Table 1.13. 



with 

replacement 

without 

replacement 

order 

identical trials 

depleting trials 

matters 

in a line 

in a line 

order 

identical trials 

depleting trials 

doesn’t matter 

in a blob 

in a blob 


Table 1.12: Descriptions of the four possibilities with regard to ordering and replacement. 



with 

replacement 

without 

replacement 

order 

ordered 

committees with 

matters 

dice rolls 

distinct assignments 

order 

unordered 

committees with 

doesn’t matter 

dice rolls 

equivalent members 


Table 1.13: Examples of the four possibilities with regard to ordering and replacement. 


1.8 Binomial coefficients 

1.8.1 Coins and Pascal’s triangle 

Let’s look at the coin-flipping example at the end of Section 1.3 in more detail. We 
found that with four coins there are six different ways to obtain exactly two Heads. 
How many ways are there to obtain other numbers of Heads? From Table 1.6, we 
see that the numbers of ways to obtain exactly zero, one, two, three, or four Heads 
are, respectively, 1, 4, 6, 4, 1. (These same numbers are relevant for Tails too, of 
course.) The sum of these numbers correctly equals the total number of possibilities, 
which is 2 4 = 16 from Eq. (1.4) with N = 2 and n = 4. 
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If we instead consider only three coins, the 2 3 = 8 possibilities are obtained by 
taking either column in Table 1.6 and removing the first letter. We quickly see that 
the numbers of ways to obtain exactly zero, one, two, or three Heads are 1, 3, 3, 1. 
With two coins, the numbers for zero, one, or two Heads are 1, 2, 1. And for one 
coin, the numbers for zero or one Heads are just 1, 1. Also, for zero coins, you can 
only obtain zero Heads, and there’s just one way to do this; you simply don’t list 
anything down, and that’s that. This is somewhat a matter of semantics, but if we 
use a “1” for this case, it will fit in nicely with the rest of the results below. 

Note that for three coins, l+3 + 3 + l=2 3 . And for two coins, 1+2 + 1 = 2 2 . 
And for one coin, 1 + 1 = 2. So in each case the total number of possibilities for n 
flips ends up being 2", consistent with Eq. (1.4). 

We can collect the above results and arrange them as shown in Table 1.14. Each 
row lists the number of different ways we can obtain the various possible numbers 
of Heads. These numbers range from 0 to n. 


n — 0: 1 

n = 1: 1 1 

n — 2: 1 2 1 

n — 3: 1 3 3 1 

n — 4: 1 4 6 4 1 


Table 1.14: Pascal’s triangle up to n = 4. 


The arrangement in Table 1.14 is known as Pascal’s triangle (for n = 4). Do these 
numbers look familiar? A couple more rows might help. If you figure things out 
for the n = 5 and n — 6 coin-flipping cases by explicitly listing out the possibilities, 
you will arrive at Table 1.15. 


n = 0: 
n — 1: 
n — 2: 
n = 3: 
n = 4: 
n = 5: 
n = 6: 


1 

1 1 

1 2 1 
13 3 1 

1 4 6 4 1 

5 10 10 5 1 

6 15 20 15 6 1 


Table 1.15: Pascal’s triangle up to n = 6. 


At this point, you might be getting a feeling of deja vu with the 10’s and 15’s, 
since we’ve seen them before at various times in this chapter. You might then make 
the (correct) guess that the entries in Table 1.15 are nothing other than the binomial 
coefficients! We defined these coefficients in Section 1.5 as the number of ways of 
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picking unordered subgroups; see Eq. (1.8). Written out explicitly in terms of the 
binomial coefficients. Table 1.15 becomes Table 1.16. 


n = 0: 
n — 1: 
n = 2: 
n = 3: 
n = 4: 
n = 5: 
n = 6: 


0 

(i) (!) 

© 0 © 

© 0 © © 

0 (1) G) 6) (J) 

© © 0 0 0 0 
0 © © 0 © © 


Table 1.16: Binomial coefficients up to n = 6. 


Now, observing a pattern and guessing the correct rule is most of the battle. But 
is there a way to prove that the entries in Table 1.16 (which are the numbers of ways 
of obtaining the various numbers of Heads) are in fact equal to the binomial coeffi¬ 
cients (which we defined as the numbers of ways of picking unordered subgroups)? 
For example, can we demonstrate that the number of ways of obtaining two Heads 
in six coin flips is ^)? Indeed we can. It’s actually almost a matter of definition, as 
the following reasoning shows. 

If we flip six coins, we can imagine having six blank spaces on the paper that 
need to be filled with H’s and T’s. If we’re considering the scenarios where two 
Heads come up, then we need to fill two of the blanks with H’s and four of them 
with T’s. So the question reduces to: How many different ways can we plop down 
two H’s in six possible spots? But this is exactly the same question as: How many 
different (unordered) committees of two people can we form from six people? The 
equivalence of these two questions is made clear if we imagine six people sitting in 
a row, and we plop down an H on two of them, with the understanding that the two 
people who get tagged with an H are the two people on the committee. 

In general, the (" ) ways that k Heads can come up in n flips of a coin correspond 

exactly to the (?) committees of k people that can be chosen from n people. Each 
coin flip corresponds to a person, and that person is declared to be on the committee 
if the result of the coin flip is Heads. 


1.8.2 (a + b) n and Pascal’s triangle 


A quick examination of Pascal’s triangle in Table 1.16 shows (as we observed 
above) that the sum of the numbers in a given row equals 2". For example. 



(1.18) 
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or more generally, 

(1.19) 

We know that this relation must be true, because both sides represent the total num¬ 
ber of possible outcomes for n coin flips (with the lefthand side enumerated accord¬ 
ing to how many Heads appear). But is there a way to demonstrate this equality 
without invoking the fact that both sides are relevant to coin flips? Indeed there is. 
We’ll give the proof in Section 1.8.3, but first we need some background. 

Consider the quantity (a + b)' 1 . You can quickly show that (a + b) 2 = a 2 +2ab + 
b 2 . And then you can multiply this by (a + b) to arrive at (a + b ) 3 = a 2 + 3 a 2 b + 
3ab 2 + b 3 . And then you can multiply by (a + b) again to obtain the expression for 
(a + b) 4 , and so on. The results are shown in Table 1.17. 


(a 

+ 

b) 1 = 

a + b 

(a 

+ 

b) 2 = 

a 2 + lab + b 2 

(a 

+ 

b) 3 = 

a 3 + 3 a 2 b + 3 ab 2 + b 3 

(a 

+ 

b) 4 = 

a 4 + 4 a 3 b + 6a 2 b 2 + 4 ab 3 + b 4 

(a 

+ 

b) 5 = 

a 5 + 5 a 4 b + 10a 3 b 2 + 10 a 2 b 3 + 5 ab 4 + b 5 

(a 

+ 

b) 6 = 

a b + 6 a 5 b + 15 a 4 b 2 + 20a V + 15a 2 £ 4 + 6 ab 5 + b' 


Table 1.17: Binomial expansion up to n = 6. 


The coefficients here are exactly the numbers in Table 1.15! And there is a very 
good reason for this. Consider, for example, (a + b ) 5 . This is shorthand for 

(a + b)(a + b)(a + b)(a + b)(a + b). (1-20) 

In multiplying this out, we obtain a number of terms; 32 of them in fact, although 
many take the same form. There are 32 terms because in multiplying out the five 
factors of (a + b ), every term in the result will involve either the a or the b from 
the first (a + b) factor, and similarly either the a or the b from the second (a + b) 
factor, and so on with the third, fourth, and fifth (a + b) factors. Since there are two 
possibilities (the a or the b) for each factor, we end up with 2 5 = 32 different terms. 

However, many of the terms are equivalent. For example, if we pick the a from 
the first and third terms, and the b from the second, fourth, and fifth terms, then we 
obtain ababb , which equals a 2 b 3 . Alternatively, we can pick, say, the a from the 
second and fifth terms, and the b from the first, third, and fourth terms, which gives 
babba, which also equals a 2 b 3 . 

How many ways can we obtain an a 2 b 3 product? Well, we have five choices 
(the five (a + b) factors) of where to pick the three Z?’s from (or equivalently five 
choices of where to pick the two a's from). So the number of ways to obtain an 
a 2 b 3 product is = 10 (or equivalently = 10), in agreement with Table 1.17. 
This reasoning makes it clear why the coefficients of the terms in the expansion of 
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(a + b) n take the general form of where k is the power of b in a given term. (It 
also works if k is the power of a.) 

In general, just as with the coin flips in Section 1.8.1, the ways that k b’ s 

can be chosen from the n factors of (a + b) correspond exactly to the ("'j committees 
of k people that can be chosen from n people. Each factor of ( a + b) corresponds to 
a person, and that person is declared to be on the committee if the b is chosen from 
that factor. 

To sum up, we have encountered three situations (committees, coins, and (a + 
b) n ) that involve binomial coefficients. And they all involve binomial coefficients 
for the same reason: they all deal with the number of ways that k things can be 
chosen from n things (unordered, and without repetition). The answer to all three 
of the following questions is the binomial coefficient 

• How many different (unordered) committees of k people can be chosen from 
n people? 

• Flip a coin n times. How many different outcomes involve exactly k Heads? 

• Expand ( a + b) n . What is the coefficient of a n ~ k b k l 

In each case, a binary choice in made n times, with k choices having the same 
result: k of the n people are given a “yes” to be on the committee, or k of the n coin 
flips are Heads, or k of the n factors of (a + b) have a b chosen from them. Note 
that, as we have observed on various occasions, the three bullet points above still 
have the answer of if we make the complementary substitution of k —> n - k. 
This substitution is equivalent to picking k people to not be on the committee, or 
replacing Heads with Tails, or replacing a"~ k b k with a k b n ~ k . 

A word on the order of our logic in this section. We originally defined the 
binomial coefficients in Section 1.5 as the number of ways of picking unordered 
subgroups (see Eq. (1.8)), and then we showed here that the binomial coefficients 
are also the coefficients that arise in the binomial expansion. This might seem a little 
backwards, because the name “binomial coefficients” suggests that they should be 
defined via the binomial expansion. But since we encountered unordered subgroups 
first in this chapter, we chose to take the “backwards” route. In the end, it doesn’t 
matter, of course, because both results are equal to N\/n\(N - n)\, and it’s just 
semantics what name we use for this quantity. 

See Problem 1.8 for a generalization of the binomial coefficient, called the multi¬ 
nomial coefficient. 


1.8.3 Properties of Pascal’s triangle 

Having established that the coefficients of the terms in the expansion of (a + b)" 
take the form of we can now quickly explain why the relation in Eq. (1.19) 
holds, without invoking anything about coins flips. The general form of the results 
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in Table 1.17 is 


(a + b) n = T| a n + \'\a n - l b+['\a n -- L b A + ■■■ + 


k=0 


1 

-k ik 


'i-2,2 


n 

n — 1 


ab‘ 


i-i 


Z r» 


( 1 . 21 ) 


This is known as the binomial expansion, or binomial theorem, or binomial formula. 
It holds for any values of a and b. Therefore, since we are free to pick a and b to 
be whatever we want, let’s pick them both to be 1. Multiplication by 1 doesn’t 
affect anything, so we can just erase all of the a's and b’s on the righthand side 
of Eq. (1.21). We then see that the righthand side is equal to the lefthand side of 
Eq. (1.19). But the lefthand side of Eq. (1.21) is (1 + 1)", which is simply 2", which 
is equal to the righthand side of Eq. (1.19). We have therefore proved Eq. (1.19). 

Another nice property of Pascal’s triangle, which you can verify by looking at 
Table 1.15, is that each number is the sum of the two numbers above it (or just the 
“1” above it, if it occurs at the end of a line). For example, in the n — 6 line, 20 is 
the sum of the two 10’s above it (that is, ^ + (3))- And the first 15 is the 

sum of the 5 and 10 above it (that is, ( 2 ) = ( 1 ) + ( 2 ))- Likewise for the second 15 
and all the other numbers. Written out explicitly, the general property is 



( 1 . 22 ) 


The task of Problem 1.15 is to give a mathematical proof of this relation, using the 
explicit form of the binomial coefficients. But let’s demonstrate it here in a more 
intuitive way by taking advantage of what the binomial coefficients mean in terms 
of choosing committees. Relations among binomial coefficients often have intuitive 
proofs like this which involve no (or very little) math. 

In words, Eq. (1.22) says that the number of ways to pick k people from n people 
equals the number of ways to pick k — 1 people from n — 1 people, plus the number 
of ways to pick k people from n— 1 people. Does this make sense? Yes indeed, due 
to the following reasoning. 

Let’s single out one of the n people, whom we will call Alice. There are two 
types of committees of k people: those that contain Alice, and those that don’t. How 
many committees of each type are there? If the committee does contain Alice, then 
the other k — 1 members must be chosen from the remaining n — 1 people. There 
are (^l[) ways to do this. If the committee does not contain Alice, then all k of the 
members must be chosen from the remaining n-l people. There are ("j) *) ways to 

do this. Since each of the total number of committees falls into one or the other 
of these two categories, we therefore arrive at Eq. (1.22). 

The task of Problem 1.16 is to reproduce the reasoning in the preceding para¬ 
graph to demonstrate Eq. (1.22), but instead in the language of coin flips or the 
(a + b) n binomial expansion. 
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1.9 Summary 


In this chapter we learned how to count things. In particular, we learned: 

• AM (“N factorial”) is defined to be the product of the first N integers: 

AM = 1 • 2 • 3 • • • (A - 2) • (A - 1) • AM (1.23) 


• The number of different permutations of N objects (that is, the number of 
different ways of ordering them) is IV!. 

• Consider a process for which there are N possible results each time it is per¬ 
formed. If it is performed n times, then the total number of possible combined 
outcomes, where the order does matter, equals 

N" (ordered, with repetition) (1-24) 

Examples include rolling an /V-sidcd die n times, or drawing one of N balls 
from a box n times, with replacement each time (so that all of the draws are 
equivalent). 

• Given N people, the number of different ways to choose an //-person com¬ 
mittee where the order does matter (for example, where there are n distinct 
positions) equals 

AM 

nPh - - (ordered, without repetition) (1-25) 

( N-n)\ 

• Given N people, the number of different ways to choose an /(-person commit¬ 
tee where the order doesn’t matter is denoted by and it equals 

(N\ AM 

N C n = I = - (unordered, without repetition) (1.26) 

\n J n\(N - //)! 


• Consider a process for which there are N possible results each time it is per¬ 
formed. If it is performed n times, then the total number of possible combined 
outcomes, where the order doesn’t matter, equals 


nU„ 


+ (N — 1) j 


(unordered, with repetition) 


(1.27) 


• The binomial coefficients which appear in Pascal’s triangle, are relevant 
in three situations we have discussed: (1) choosing committees, (2) flipping 
coins, and (3) expanding (a + b )". All three of these situations involve count¬ 
ing the number of ways that k things can be chosen from n things (unordered, 
and without repetition). 


1.10 Exercises 

See www.people.fas.harvard.edu/~djmorin/book.html for a supply of problems 
without included solutions. 



36 


Chapter 1. Combinatorics 


1.11 Problems 

Section 1.2: Permutations 

1.1. Assigning seats * 

Six girls and four boys are to be assigned to ten seats in a row, with the 
stipulations that a girl sits in the third seat and a boy sits in the eighth seat. 
How many arrangements are possible? 

Section 1.3: Ordered sets, repetitions allowed 

1.2. Number of outcomes * 

One person rolls two six-sided dice, and another person flips six two-sided 
coins. Which setup has the larger number of possible outcomes, assuming 
that the order matters? 

Section 1.4: Ordered sets, repetitions not allowed 

1.3. Subtracting the repeats ** 

(a) From Eq. (1.6) we know that the number of ordered sets of three people 
chosen from five people is 5 • 4 • 3 = 60. Reproduce this result by 
starting with the naive answer of 5 3 = 125 ordered sets where repetitions 
are allowed, and then subtracting off the number of triplets that have 
repeated people. 

(b) It’s actually not much more difficult to solve this problem in the general 
case where triplets are chosen from N people, instead of five. Repeat 
part (a) for a general N. 

1.4. Subtracting the repeats, again ** 

Repeat the task of Problem 1.3(a), but now in the case where you pick quadru¬ 
plets (instead of triplets) from five people. 

Section 1.5: Unordered sets, repetitions not allowed 

1.5. Sum from 1 to IV * 

In Table 1.9 we saw that if we pick two (unordered) people from a group of 
five people, the ( 3 ) = 10 possibilities can be listed as shown in Table 1.18. 

AB 

AC BC 
AD BD CD 

AE BE CE DE 

Table 1.18: Unordered pairs chosen from five people. 

If we look at the number of pairs in each row, we see that we can write 10 
as 1 + 2 + 3 + 4. If we add on a sixth person, we’ll need to add on a fifth 
row (AF, BF, CF, DF, EF), so we see that the number of possibilities, namely 
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Qj = 15, can be written asl+2 + 3+ 4 + 5. This patterns continues, and 
we find that the number of possible (unordered) pairs that we can pick from 
N people equals the sum of 1 through N — 1. But we already know that the 
number of pairs equals = N(N -l)/2. So it must be the case that the sum 
of 1 through N - l equals N(N - l)/2. Equivalently, if we replace N - 1 by 
N here, it must be the case that the sum of 1 through N equals (N + l)N/2, 
which people usually write as N(N + l)/2. Demonstrate this result in two 
other ways: 

(a) Write down the numbers 1 through N in increasing order in a horizontal 
line. And then below this string of numbers, write them down again but 
in decreasing order. Then add each number to the one above/below it, 
and take it from there. 

(b) First, quickly verify that the result holds for N — 1. Second, demon¬ 
strate mathematically that if the result holds for the sum of 1 through N , 
then it also holds for the sum of 1 through N + 1. Since the latter sum 
is simply N + 1 larger than the former, this amounts to demonstrating 
that N(N + l)/2 + (N + 1) = (N + 1 )(N + 2)/2. (The righthand side 
here is the proposed result, with N replaced by N + 1.) Third, explain 
why the preceding two facts imply that the result is valid for all N. The 
technique here is called mathematical induction. (This problem is an 
exercise more in mathematical induction than in combinatorics. But it’s 
included here because the induction technique is something that every¬ 
one should see at least once!) 

1.6. Many ways to count * 

How many different orderings are there of the six letters: A, A, A, B, B, C? 
How many different ways can you think of to answer this question? 

1.7. Committees with a president ** 

Two students are given the following problem: From N people, how many 
ways are there to choose a committee of n people, with one person chosen as 
the president? One student gives an answer of while the other student 
gives an answer of N ( N n f \). 

(a) By writing out the binomial coefficients, show that the two answers are 
equal. 

(b) Explain the (valid) reasonings that lead to these two (correct) answers. 

1.8. Multinomial coefficients ** 

(a) A group of ten people are divided into three committees. Three people 
are on committee A, two are on committee B, and five are on committee 
C. How many different ways are there to divide up the people? 

(b) A group of N people are divided into k committees. n\ people are on 

committee 1, nj people are on committee 2, ..., and «/. people are on 
committee k, with n\ + ri 2 + ... + = N. How many different ways 

are there to divide up the people? 
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1.9. One heart and one 7 ** 

How many different five-card poker hands contain exactly one heart and ex¬ 
actly one 7? (If the hand contains the 7 of hearts, then this one card satisfies 
both requirements.) 

1.10. Poker hands *** 

In a standard 52-card deck, how many different five-card poker hands are 
there of each of the following types? For each type, it is understood that we 
don’t count hands that also fall into a higher category. For example, when 
counting the three-of-a-kind hands, we don’t count the full-house or four-of- 
a-kind hands, even though they technically contain three cards of the same 
kind. 

(a) Full house (three cards of one value, two of another). We already solved 
this in the last example in Section 1.5, but we’re listing it again here so 
that all of the results for the various hands are contained in one place. 

(b) Straight flush (five consecutive values, all of the same suit). In the spirit 
of being realistic, assume that aces can be either high (above kings) or 
low (below 2’s). 

(c) Flush (five cards of the same suit), excluding straight flushes. 

(d) Straight (five consecutive values), excluding straight flushes. 

(e) One pair. 

(f) Two pairs. 

(g) Three of a kind. 

(h) Four of a kind. 

(i) None of the above. 

Section 1.7: Unordered sets, repetitions allowed 

1.11. Rolling two dice * 

(a) Two standard 6-sided dice are rolled. Find the total number of un¬ 
ordered outcomes by looking at Table 1.5. 

(b) Find the total number of unordered outcomes by using Eq. (1.16). 

(c) By taking the lead from Table 1.5, find the total number of unordered 
outcomes for two /V-sidcd dice, and then verify that your result agrees 
with Eq. (1.16). 

1.12. Unordered coins * 

If you flip n coins and write down the unordered list of Heads and Tails that 
you obtain, what does Eq. (1.16) give for the number of possible outcomes? 

The simplicity of the result you just obtained suggests that there is alternative 
way of deriving it. Give an intuitive explanation of your answer that doesn’t 
rely on Eq. (1.16). 
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1.13. Proof without stars and bars *** 

This problem gives a (longer) proof of Eq. (1.16) that doesn’t rely on the 
stars-and-bars reasoning that we used in Section 1.7. 

(a) When explicitly counting (that is, without using Eq. (1.16)) the number 
of unordered outcomes for n identical trials, each with N possible out¬ 
comes, we saw in Problem 1.12 that it is helpful to list the outcomes 
according to how many times a given individual result (such as Heads) 
appears. Use this strategy to count the number of possible outcomes for 
the N - 3 case (with arbitrary n). You may assume that you already 
know the result for the N = 2 case in Problem 1.12. You will need to 
use the result from Problem 1.5. 

(b) The way in which the N - 3 result (with arbitrary n) follows from the 
N - 2 result (with arbitrary n) suggests an inductive proof of Eq. (1.16) 
for general N. By again listing (or imagining listing) the outcomes ac¬ 
cording to how many times a given individual result appears, and by 
making use of Problem 1.17 below (so you should look at that problem 
before solving this one), show inductively that if Eq. (1.16) holds for 
(V - 1, then it also holds for N. (See Problem 1.5 for an explanation of 
mathematical induction.) 

1.14. Yahtzee 

In the game of Yahtzee™, five dice are rolled in a group, with the order not 
mattering. 

(a) Using Eq. (1.16), how many unordered rolls (sets) are possible? 

(b) In the spirit of the examples at the beginning of Section 1.7, reproduce 
the result in part (a) by determining how many unordered rolls there 
are of each general type (for example, three of one number and two of 
another, etc.). 

(c) In the spirit of the example at the end of Section 1.7, show that the total 
number of ordered Yahtzee rolls is 6 5 = 7776. 


Section 1.8: Binomial coefficients 


1.15. Pascal sum 1 * 

Using = n\/k\(n — k)l, show that 



(1.28) 



40 


Chapter 1. Combinatorics 


1.16. Pascal sum 2 ** 

At the end of Section 1.8.3, we demonstrated the relation + ("^*) 

by using an argument involving committees. Repeat this reasoning, but now 
in terms of: 

(a) coin flips, 

(b) the (a + b) n binomial expansion. 

1.17. Pascal diagonal sum ** 

(a) If we pick an unordered committee of three people from five people (A, 
B, C, D, E), we can list the = 10 possibilities as show in Table 1.19. 
We have grouped them according to which letter comes first. (The or¬ 
der of letters doesn’t matter, so we’ve written each triplet in increasing 
alphabetical order.) The columns in the table tell us that we can think of 
10 as equaling 6 + 3 + 1. Explain why it makes sense to write this sum 

as ( 2 ) + ( 2 ) + ( 2 )- 

ABC 

ABD 

ABE 

ACD BCD 
ACE BCE 
ADE BDE CDE 

Table 1.19: Unordered triplets chosen from five people. 


(b) You can also see from Tables 1.15 and 1.16 that, for example, 
( 2 ) + ( 2 ) + ( 2 ) + (!)■ More generally. 




In words: A given number (for example, in Pascal’s triangle equals 
the sum of the numbers in the diagonal string that starts with the number 
that is above and to the left of the given number ((^j in this case) and 
then proceeds upward to the right. So the string contains Q), (f), (o), 
and (t) in this case. 

Prove Eq. (1.29) by making repeated use of Eq. (1.22), which says that 
each number in Pascal’s triangle is the sum of the two numbers above it 
(or just the “1” above it, if it occurs at the end of a line). Hint: No math 
needed! You just need to draw a few pictures of Pascal’s triangle after 
successive applications of Eq. (1.22). 
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1.12 Solutions 

1.1. Assigning seats 

There are six ways to pick the girl who sits in the third seat, and then for each of these 
choices there are four ways to pick the boy who sits in the eighth seat. For each of 
these 6 • 4 = 24 combinations, there are 8 ! = 40,320 permutations of the remaining 
eight people in the remaining eight seats. The total number of possible arrangements 
with the given stipulations is therefore 24 • 40,320 = 967,680. This is smaller than 
the answer of 10! in the case with no stipulations, by a factor of (6 • 4 • 8 !)/10! = 
(6 • 4)/(10 • 9) » 0.27. 

1.2. Number of outcomes 

In the case of the two six-sided dice, using N = 6 and n = 2 in Eq. (1.4) gives 6 2 = 36 
possible outcomes. In the case of the six two-sided coins, using N = 2 and n = 6 in 
Eq. (1.4) gives 2 6 = 64 possible outcomes. The latter setup therefore has the larger 
number of possible outcomes. 

If we replace the number 6 in this problem with, say, 20 (for example, we can roll 
the icosahedral die on the cover of this book), and if we keep the 2 the same, then 
the above two results become, respectively, 20 2 = 400 and 2 20 = 1,048,576. The 
latter result is larger than the former by a factor of about 2600, whereas in the original 
problem the factor was only about 1.8. The two results are equal if we replace the 6 
with 4 (which corresponds to a tetrahedral die). 

1.3. Subtracting the repeats 

(a) If repetitions are allowed, there are two general types of ordered triplets that 
contain repeated people: all three people can be the same (such as AAA), or 
two people can be the same, with the third being different (such as AAB). Since 
we are choosing from five people, there are five triplets of the first type (AAA 
through EEE). 

How many triplets are there of the second type? There are five ways to pick 
the letter that appears twice, and then four ways to pick the letter that appears 
once from the remaining four letters. And then for each of these 5 ■ 4 = 20 
combinations, there are three ways to order the letters (AAB, ABA. BAA). So 
there are 20 • 3 = 60 ordered triplets of the general type AAB. 

The total number of ordered triplets that contain repeated people is therefore 
5 + 60 = 65. Subtracting this from the 5 3 = 125 total number of ordered 
triplets (with repetitions allowed) gives 125 - 65 = 60 ordered triplets without 
repetitions, as desired. 

(b) Again, if repetitions are allowed, there are two general types of ordered triplets 
that contain repeated people: AAA and AAB. Since we are choosing from N 
people, there are now N possible letters, so there are N triplets of the first type. 
How many triplets are there of the second type? There are N ways to pick the 
letter that appears twice, and then N - 1 ways to pick the letter that appears 
once from the remaining N — 1 letters. And then for each of these N(N - 1 ) 
combinations, there are three ways to order the letters (AAB, ABA. BAA). So 
there are N(N - 1) • 3 ordered triplets of the general type AAB. 

The total number of ordered triplets that contain repeated people is therefore 
N+3N(N—\) = 3N--2N. Our goal is to show that when this is subtracted from 
the A 3 total number of ordered triplets (with repetitions allowed), we obtain the 
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N(N - l)(N - 2) result in Eq. (1.6) for triplets without repetitions. So we want 
to show that 

N 3 - (3 N 2 - 2 N) = N(N - 1 )(A - 2). (1.30) 

If you multiply out the righthand side, you will quickly see that the desired 
equality holds. 

1.4. Subtracting the repeats, again 

Our goal is to show that when the number of ordered quadruplets with repeated people 
is subtracted from the 5 4 = 625 total number of ordered quadruplets (with repetitions 
allowed), we obtain the correct number 5 ■ 4 ■ 3 • 2 = 120 of ordered quadruplets 
without repetitions. If repetitions are allowed, there are four general types of ordered 
quadruplets that contain repeated people: AAAA, AAAB, AABB, and AABC. Let’s 
look at each of these in turn. 

• First type: Since we are choosing from five people, there are five quadruplets 
of this type (AAAA through EEEE). 

• Second type: There are five ways to pick the letter that appears three times, 
and then four ways to pick the letter that appears once from the remaining four 
letters. And then for each of these 5 • 4 = 20 combinations, there are four ways 
to order the letters (AAAB, AABA, ABAA, BAAA). So there are 20 ■ 4 = 80 
ordered quadruplets of the general type AAAB. 

• Third type: There are = 10 ways to pick the two letters that appear. And 

then for each of these combinations, there are = 6 ways to order the let¬ 
ters (AABB, ABAB, ABBA, BBAA, BABA, BAAB). So there are 10 ■ 6 = 60 
ordered quadruplets of the general type AABB. 

• Fourth type: There are five ways to pick the letter that appears twice, and then 
(o) = 6 ways to pick the other two letters from the remaining four letters. And 
then for each of these 5 • 6 = 30 combinations, there are 12 ways to order the 
letters (four ways to pick the location of one of the single letters, and then three 
for the other). So there are 30 • 12 = 360 ordered quadruplets of the general type 
AABC. 

The total number of ordered quadruplets that contain repeated people is therefore 5 + 
80 + 60 + 360 = 505. Subtracting this from the 5 4 = 625 total number of ordered 
quadruplets (with repetitions allowed) gives 625 - 505 = 120 ordered quadruplets 
without repetitions, as desired. 

In the same manner as in Problem 1.3(b), you can solve this problem in the general 
case where quadruplets are chosen from N people, instead of five. The math gets a 
little messy, but in the end it comes down to replacing every 5 in the above solution 
with an N , and replacing the appropriate 4’s with (N - l)’s. 

1.5. Sum from 1 to A 

(a) Our instructions are to write down the following two horizontal strings of num¬ 
bers: 


12 3 N-2 N -1 N 

N N -1 N-2 3 21 

Note that every column of two numbers has the same sum, namely N + 1. And 
since there are N columns, the total sum of the two rows (viewed as N columns) 
is N(N +1). We have counted every number twice, so the sum of the numbers 1 
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through N is half of N{N + 1), that is, N(N + l)/2. As we’ve seen many times 
throughout this chapter, and as we’ll see many more times, things become much 
clearer if you group objects in certain ways! 

Remark: One day when he was in grade school (or so the story goes), the Ger¬ 
man mathematician Carl Friedrich Gauss (1777-1855) encountered the above 
problem. His teacher was trying to quiet the students by giving them the task of 
adding up the numbers 1 through 100, thinking that it would occupy them for a 
while. But to the teacher’s amazement, Gauss quickly came up with the correct 
answer, 5050, by cleverly thinking of the above method on the spot. * 

(b) Our first task is easy. If N = 1 then the sum of the numbers 1 through N = 1 is 
simply 1, which equals N(N + l)/2 when N = 1. 

For our second task, if we assume that the sum of 1 through N equals N(N + 
l)/2, then the sum of 1 through N + 1 is N + 1 more than that, so it equals 

l+2 + 3 + --- + N + (N+l) = 


which is the proposed result with N replaced by N + 1, as desired. 

Now for the third task. We have demonstrated two facts: First, we have shown 
that the result (that the sum of 1 through N equals N(N+ l)/2) holds for N = 1. 
And second, we have shown that if the result holds for N , then it also holds 
for N + 1. (This second fact is called the inductive step in the proof.) The 
combination of these two facts implies that the result holds for all N , by the 
following reasoning. Since the result holds for N = 1, the second fact implies 
that it also holds for N = 2. And then since it holds for N = 2, the second fact 
implies that it also holds for N = 3. And then since it holds for N = 3, the 
second fact implies that it also holds for N = 4. And so on. The result therefore 
holds for all N (positive integers). 

Remarks: This method of proof (mathematical induction) requires that you 
already have a guess for what the answer is. The induction reasoning then lets 
you rigorously prove that your guess is correct. If you don’t already know the 
answer (which is N(N + l)/2 in the present case), then mathematical induction 
doesn’t help you. In short, with mathematical induction, you can prove a result, 
but you can't derive it. 

Note that although it was trivial to demonstrate, the first of the above two facts 
(that the result holds for N = 1) is critical in an inductive proof. The second fact 
alone isn’t sufficient for the proof. As an example of why this is true, let’s say 
that someone proposes that the sum 1+2 + 3 + -- - + A equals N(N + l)/2 + 73. 
(Any other additive constant would serve the purpose here just as well.) This 
expression is obviously incorrect, even though it does satisfy the inductive step. 
This can be seen by tacking a 73 on to the N(N + l)/2 term in the second line of 
Eq. (1.31). So our new (incorrect) guess does indeed satisfy the statement, “If it 


(l + 2 + 3 + ••• + N) + (N + 1) 
N(N+ 1) 

—--+ ( N + 1 ) 


In 

(2V + 1) [ — + 1 

, N + 2 
(N+ 1)' 


(N+ l)(N + 2) 


(1.31) 
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holds for N, then it also holds for N + 1.” The problem, however, is that the “if” 
part of this statement is never satisfied. The guess doesn’t hold for N = 1 (or 
any other value of N), so there is no number at which we can start the inductive 
chain of reasoning. * 

1.6. Many ways to count 

We’ll present four solutions: 

First solution: There are = 20 ways to choose where the three A's go in the six 
possible places. For each of these 20 ways, there are Q) = 3 ways to choose where 

the two B’s go in the remaining three places (or equivalently (j) = 3 ways to choose 
where the one C goes). The total number of orderings is therefore 20 • 3 = 60. 

Second solution: There are = 15 ways to choose where the two B’s go in the six 
possible places. For each of these 15 ways, there are = 4 ways to choose where 

the three A’s go in the remaining four places (or equivalently (“jj = 4 ways to choose 
where the one C goes). The total number of orderings is therefore 15 • 4 = 60. 

Third solution: There are = 6 ways to choose where the C goes in the six possible 
places. For each of these 6 ways, there are (j) = 10 ways to choose where the three 
A’s go in the remaining five places (or equivalently Q = 10 ways to choose where 
the two B’s go). The total number of orderings is therefore 6 • 10 = 60. 


Fourth solution: Let’s forget for a moment that the three A’s, along with the two B’s, 
are equivalent. If we treat all six letters as distinguishable, then there are 6! = 720 
ways to order them. However, since the three A’s are in fact indistinguishable, we 
have overcounted the number of orderings by a factor of 3!, because that is the number 
of ways to order the three A’s. Similarly, the two B’s are indistinguishable, so we 
have also overcounted by 2!. The actual number of different orderings is therefore 
6!/(3!2!) = 720/(6-2) = 60. 

1.7. Committees with a president 


(a) If we write out the binomial coefficients, the equality to be demonstrated is 


N\ N- 1 
n\ = N\ 

\ n I \n — 1 


N\ 


nl(N - n)\ 
N\ 


= N- 


(IV-1)! 


(n - 1)!(7V - n)\ 
N\ 


(n - 1)!(7V - n)\ {n - l)\(N - «)! ’ 


(1.32) 


which is indeed true. 


(b) First student’s reasoning: Imagine first picking the n committee members 
(there are ways to do this), and then picking the president from these n 
people (there are n ways to do this). The total number of ways to form a com¬ 
mittee with a president is therefore • 

Second student’s reasoning: Imagine first picking the president from the com¬ 
plete set of N people (there are N ways to do this), and then picking the other 
n - 1 committee members from the remaining N - 1 people (there are (^ri 1 ) 
ways to do this). The total number of ways to form a committee with a president 
is therefore 
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1.8. Multinomial coefficients 


(a) First solution: There are A°) ways to choose the three members of committee 

A. And then from the remaining seven people, there are ( 2 ) ways to choose the 
two members of committee B. The five remaining people are then on committee 
C. The total number of ways to choose the committees is therefore 


10! 7! 
3!7! 2!5! 


10 ! 

3!2!5! 


: 2,520. 


(1.33) 


Alternatively, we can use the above reasoning but consider the committees in a 
different order. For example, we can first pick the two members of committee 
B, and then the five members of committee C. This yields an answer of 


10 ! 8 ! 
2!8! 5!3! 


10 ! 

2!5!3! 


: 2,520. 


(1.34) 


Considering the committees in any other order will give the same answer, as you 
can check. One of the factorials will always cancel, and you will be left with 
the product 3!2!5! in the denominator. 


Second solution: Since the numbers of people on the committees are 3, 2, and 
5, the appearance of the product 3!2!5! in the denominator suggests that there 
is a more streamlined way of obtaining the answer. And indeed, imagine lining 
up ten seats, with the first three labeled A, the next two labeled B, and the last 
five labeled C. There are 10! different ways to assign the ten people to the ten 
seats. But the 3! possible permutations of the first three people don't change 
the committee A assignments, because we don't care about the order of people 
within a committee. So the 10! figure overcounts the number of committee 
assignments by 3!. We therefore need to divide 10! by 3!. Likewise, the 2! 
permutations of the people in the B seats and the 5! permutations of the people 
in the C seats don't change the committee assignments. So we also need to 
divide by 2! and 5!. The correct number of different committee assignments is 
therefore 10!/(3!2!5 !). 

(b) The reasoning in the second solution above immediately extends to the general 
case, so the answer is 

AM 

—j—i-7 ' < L35) 

n \ lri 2 ! • • • np .! 

In short, there are AM ways to assign N people to N seats in a row. But the n t -! 
permutations of the people within each committee don't change the committee 
assignments. So AM overcounts the true number of assignments by the product 
n\ !«o! • • • !. We must therefore divide AM by this product. 

Alternatively, we can use the reasoning in the first solution above. There are 
(n[) wa y s t° choose the n\ members of committee 1. And then from the re¬ 
maining N - n\ people, there are (At"” 1 ) ways to choose the ho members of 
committee 2. And so on. The total number of ways to choose the committees is 
therefore 

N\lN - n\\(N - n\ - ti 2 
Ml A «2 A «3 

AM (N -m)l (N - «i - n 2 )! 

«i!(iV —«i)! n 2 \{N - n\ — W 2 )! n 2 ,\{N — n\ - n 2 — nj,)\ 

AM 

Hi !Ho!' ' ' n k • 



(1.36) 
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Most of the factorials cancel in pairs. The last factorial in the denominator, 
namely (N — n\ - n 2 - ■ ■ ■ — n k )\, equals 0! = 1, because the sum of the n,- 
equals N. 

The above result can be extended quickly to the case where only a subset of the 
N people are assigned to be on the committees, that is, where 2 >U < N. In 
this case, we can simply pretend that the leftover people are on one additional 
committee. So we now have k + 1 committees, where 2 >U = IV. For example, 
if the task of this problem were instead to pick the three committees (with 3, 2, 
and 5 people) from a set of 16 people, then the number of possible ways would 
be 16!/(3!2!5!6!), which is about 20 million. 


Remark: The expression in Eq. (1.35) is called a multinomial coefficient (anal¬ 
ogous to the binomial coefficient) and is denoted by 


' " ) 
ni,n 2 ,...,n k ) 


AM 

n\ \n 2 \•• • n k ! 


(1.37) 


where it is understood that n\ + n 2 + - ■ - + n k = N. In the multinomial-coefficient 
notation, the standard binomial coefficient is written as ( n But 

this k = 2 case, people always just write . However, for all other k, the 
convention is to list all k numbers in the lower entry of the coefficient. 

The multinomial coefficients appear the expansion, 


C*t +x 2 + ■ ■■ + x k ) 


N 


z 

H’H=N 


N 


« 1 ,« 2 , 


,n k 


n i n 2 njc 

X 1 X 2 "' X k ■ 


(1.38) 


The multinomial coefficients appear here for exactly the same reason they ap¬ 
pear in the above solution involving the number of committees. If we look at a 
particular x" Cv ” 2 • • • x! k term on the righthand side of Eq. (1.38), the n \ factors 
of x\ can come from any n\ of the N factors of (. x\+x 2 +• • -+x k ) on the lefthand 
side. Picking these n\ factors is equivalent to picking a specific set of n\ people 
to be on committee 1. Likewise for the x" 2 factor and the n 2 people on com¬ 
mittee 2. And so on. The number of ways to pick a particular x”*x" 2 • • • x'ffi 
product is therefore equal to the number of ways to pick committees of n i, n 2 , 
n k people. That is, the coefficient in the sum in Eq. (1.38) equals the ex¬ 
pression in Eq. (1.35). The reasoning we used here is basically the same as the 
reasoning we used in Section 1.8.2 for the case of binomial coefficients. * 


1.9. One heart and one 7 

It is easiest to deal with the 7 of hearts separately. If the hand contains this card, then 
none of the other four cards in the hand can be a heart or a 7. There are 12 other 
hearts and three other 7’s. So including the 7 of hearts, 16 cards are ruled out, which 
leaves 36. The number of ways to choose four cards from 36 is = 58,905. This 
is therefore the number of desired hands that contain the 7 of hearts. 

Now consider the hands that don’t contain the 7 of hearts. There are 12 other hearts 
and three other 7’s to choose from. So there are 12 • 3 = 36 ways to choose the two 
cards of the required type. For the remaining three cards, there are again 36 cards to 
choose from, yielding = 7,140 possibilities. The total number of desired hands 
that lack the 7 of hearts is then 36 • 7,140 = 257,040. 

The total number of desired hands (with or without the 7 of hearts) is therefore 58,905+ 
257,040 = 315,945. This is about 12% of the (j 2 ) = 2,598,960 total number of poker 
hands. 
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1.10. Poker hands 

(a) Full house: There are 13 ways to choose the value that appears three times, and 
(j) = 4 ways to choose the specific three cards from the four (the four suits) 
that have this value. And then there are 12 ways to choose the value that appears 
twice from the remaining 12 values, and = 6 ways to choose the specific two 
cards front the four that have this value. The total number of full-house hands is 
therefore 

13 ' (3) ' 12 ' (2) =3 ’ 744 - (1-39) 

(b) Straight flush: The five consecutive values can be A, 2,3,4,5, or 2,3,4,5,6, 
and so on until 10, J, Q, K, A. There are 10 of these sequences; remember that 
aces can be high or low. Each sequence can occur in four possible suits, so the 
total number of straight-flush hands is 

4 ■ 10 = 40. (1.40) 

Of these 40 hands, four of them are the Royal flushes, consisting of 10, J, Q, K, A 
(one for each suit). 

(c) Flush: The number of ways to pick five cards from the 13 cards of a given suit 
is (j 3 ). Since there are four suits, the total number of flush hands is 4 • (j 3 ) = 
5,148. However, 40 of these were already counted in the straight-flush category 
above, so that leaves 

4 -( 13 )-40 = 5,108 (1.41) 

hands that are “regular” flushes. 

(d) Straight: The 10 sequences listed in part (b) are relevant here. But now there 
are four possible choices (the four suits) for each of the five cards. The total 
number of straight hands is therefore 10 ■ 4 5 = 10,240. However, 40 of these 
were already counted in the straight-flush category above, so that leaves 

10 -4 5 - 40= 10,200 (1.42) 

hands that are “regular” straights. 

(e) One pair: There are 13 ways to pick the value that appears twice, and (2) = 6 
ways to choose the specific two cards from the four that have this value. The 
other three values must all be different, and they must be chosen from the re¬ 
maining 12 values. There are (^ ways to do this. And then there are four 
possible choices (the four suits) for each of these three values, which brings in 
a factor of 4 3 . The total number of pair hands is therefore 

13 ■ • 4 3 = 1,098,240. (1.43) 

Alternatively, you can count this as 13 • ( 2 ) ' 48 • 44 • 40/3! = 1,098,240, 
because after picking the pair, there are 48 choices for the third card (because 
one value is off limits), then 44 choices for the fourth card (because two values 
are off limits), and then 40 choices for the fifth card (because three values are 
off limits). But we have counted the 3! possible permutations of a given set 
of third/fourth/fifth cards as distinct. Since the order doesn’t matter, we must 
correct for this by dividing by 3!, which gives the above result. 
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Note that when counting the pair hands, we don't need to worry about double 
counting any flushes, because the two cards in the pair necessarily have different 
suits. Likewise, we don't need to worry about double counting any straights, 
because the two cards in the pair have the same value, by definition. 

(f) Two pairs: There are (l, 3 ) ways to choose the two values for the two pairs. For 
each pair, there are ( 2 ) = 6 ways to choose the specific two cards from the four 
that have that value. This brings in a factor of 6 2 . And then there are 44 choices 
for the fifth card, since two values are off limits. The total number of two-pair 
hands is therefore 

' 44 = 123 ’ 552 - 0-44) 

(g) Three of a kind: There are 13 ways to pick the value that appears three times, 
and = 4 ways to choose the specific three cards from the four that have this 
value. The other two values must be different, and they must be chosen from the 
remaining 12 values. There are (If) to do this. And then there are four possible 
choices for each of these two values, which brings in a factor of 4 2 . The total 
number of three-of-a-kind hands is therefore 

13 ' ( 3 ) ‘ (^l = 54 ’ 912 ' (1-45) 

Alternatively, as in part (e), you can think of this as 13-( 4 ) -48-44/2! = 54,912. 

(h) Four of a kind: There are 13 ways to pick the value that appears four times, 
and then only ( 4 ) = 1 way to choose the specific four cards from the four that 
have this value. There are 48 choices for the fifth card, so the total number of 
four-of-a-kind hands is 

13 ' ( 4 ) ' 48 = 624. (1.46) 

(i) None of the above: The easy way to calculate this number is to subtract the 
sum of the results in parts (a) through (h) from the total number of possible 
poker hands, namely ( 5 5 2 ) = 2,598,960. But let’s do it the hard way. 

We’ll start by considering only the values of the cards and ignoring the suits. 
Since we don’t want any pairs, we’re concerned with hands where all five values 
are different (for example, 3,4,7, J, K). There are (j 3 ) ways to pick these five 
values. However, we also don’t want any straights (such as 3,4,5,6,7), so we 
must exclude these. As in parts (b) and (d), there are 10 different sequences of 
straights (remembering that aces can be high or low). So the number of possible 
none-of-the-above sets of values is (j 3 ) - 10 . 

We must now account for the possibility of different suits. For each of the 
(j 3 ) - 10 sets of values, each value has four options for its suit, so this brings 
in a factor of 4 5 . However, we don't want any flushes, so we must exclude 
these. There are four possible flushes (one for each suit) for each set of values, 
so the number of possible none-of-the-above suit combinations for each of the 
(j 3 ) - 10 sets of values is 4 5 - 4. The total number of none-of-the-above hands 
is therefore 

j - loj • (4 5 - 4) = 1,302,540. (1.47) 

These none-of-the-above hands are commonly known as “high card” hands, be¬ 
cause the hand's rank is determined by the highest card it contains (or the second 
highest if there is a tie, etc.). 
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Let’s now check that all of our results correctly add up to the total number 
of possible hands, = 2,598,960. The various results (along with their 
percentages) are listed in Table 1.20 in order of increasing frequency. We see 
that they do indeed add up correctly. Note that one-pair and none-of-the-above 
hands account for 92% of the total number of hands. 


Royal flush = 

4 

0.00015% 

Straight flush (not Royal) = 

36 

0.0014% 

Four of a kind = 

624 

0.024% 

Full house = 

3,744 

0.14% 

Flush (not straight flush) = 

5,108 

0.20% 

Straight (not straight flush) = 

10,200 

0.39% 

Three of a kind = 

54,912 

2.1% 

Two pairs = 

123,552 

4.8% 

One pair = 

1,098,240 

42.3% 

None of the above = 

1,302,540 

50.1% 


Total = 2,598,960 


Table 1.20: The numbers of different poker hands. 


1.11. Rolling two dice 


(a) Table 1.5 lists all 6 2 = 36 ordered outcomes of two rolls. Since we aren’t 
concerned with the order here, we are interested only in the upper-right, or the 
lower-left, triangle of the square (with non-repeated numbers), along with the 
diagonal (with repeated numbers). The upper-right, or the lower-left, triangle 
has = 15 entries. And the diagonal has six entries. So the total number of 
unordered outcomes is 15 + 6 = 21. 

Alternatively, if we ignore the duplicate lower-left triangle, there are six entries 
in the top row, five in the second, four in the third, etc. So the total number of 
unordered outcomes is the sum 6 + 5 + 4 + 3 + 2+ 1 = 21. 

(b) This setup is the A = 6 and n = 2 case of Eq. (1.16), because there are A = 6 
possible results for each of the n = 2 rolls. So Eq. (1.16) gives the total number 
of unordered outcomes of two rolls as 





(1.48) 


(c) If we generalize Table 1.5 to an A by A square, then the upper-right, or the 
lower-left, triangle has = A(A - l)/2 entries. And the diagonal has A 
entries. So the total number of unordered outcomes is A(A - l)/2 + A = 
A(A+ l)/2. 

Alternatively, as in part (a), if we ignore the duplicate lower-left triangle, there 
are A entries in the top row, A - 1 in the second, A - 2 in the third, etc. So the 
total number of unordered outcomes is the sum of 1 through A, which equals 
A(A + l)/2 from Problem 1.5. 

This result agrees with Eq. (1.16) when n = 2 (with general A), because that 
equation gives 


nUi = 


/2 + (A — 1)\ 

l N-l ) 


(N + 1\ _ ( A + 1)A 

U-i) 


2 


(1.49) 
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1.12. Unordered coins 

This is the N = 2 case (with arbitrary n) of Eq. (1.16), because there are N = 2 
possible results for each of the n coin flips. So Eq. (1.16) gives the total number of 
unordered outcomes for n flips as 


2 U n 


("TS) 



n + 1. 


(1.50) 


To see why this result makes sense, consider the concrete case with, say, n = 5 flips. 
The possible outcomes are (if we arbitrarily list the Heads first in each string, since 
the order doesn’t matter): 

HHHHH HHHHT HHHTT HHTTT HTTTT TTTTT 


If we label each of these outcomes by the number of Tails, then we can write them as 
0, 1, 2, 3, 4, and 5. There are six possibilities here. More generally, if we have n flips, 
the number of Tails can range from 0 to n. There are n + 1 possibilities here, so this is 
the number of unordered outcomes. 

1.13. Proof without stars and bars 


(a) With the notation jv U„ for the result in Eq. (1.16), our goal is to determine 3 \J n . 
Let the N = 3 individual results be labeled A, B. and C. We can categorize the 
unordered outcomes of the n trials according to the number of A’s that appear. 
Let’s do this for the concrete case of n = 4, to get a feel for what’s going on. 
We’ll then consider general n. We’ll need to list all of the unordered outcomes 
here, as opposed to just one of each general type (as we did in the examples at 
the beginning of Section 1.7). The possible unordered outcomes are shown in 
Table 1.21. 


BBBB ABBB AABB AAAB AAAA 

BBBC ABBC AABC AAAC 

BBCC ABCC AACC 

BCCC ACCC 

CCCC 

Table 1.21: Unordered lists of n = 4 letters chosen from N = 3 letters, with 
replacement. The lists are grouped in columns according to how many A’s 
appear. 


This table is consistent with the results in the first example in Section 1.7, where 
we found that there are three sets of the AAAA type, six of the AAAB type, 
three of the AABB type, and three of the AABC type. 

Look at each column in the table. The first column has no A’s, so we're forming 
sets of n = 4 letters from the N = 2 other letters, B and C. The first column 
therefore has 0 U 4 entries, which we see equals 5 (consistent with Problem 1.12). 
The second column has one A, so we’re forming sets of n - 4 — 1 = 3 letters 
from the N = 2 letters B and C. The second column therefore has 2 U 3 entries, 
which we see equals 4. Similarly, the third column has three entries, the fourth 
has two, and the fifth has one. 

Note that even if we don't know what all the various 2 U„ values are, the reason¬ 
ing in the preceding paragraph still tells us that if we group the sets according 
to the number of A’s that appear, we can write down the relation, 


3I/4 = 2U4 + 2U3 + 2U2 + 2^1 + 2^0- 


( 1 . 51 ) 
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If we then invoke the iU n = n + 1 result from Problem 1.12. the righthand side 
of Eq. (1.51) equals 5 + 4 + 3 + 2+1 = 15. This agrees with the result in 
Eq. (1.16) for n = 4 and N = 3. 

Now consider a general value of n instead of the specific n = 4 value we used 
above (but still with N - 3). The list of unordered outcomes has the same 
general form as in Table 1.21, except that there are now n + I columns instead 
of five. In the first column (with no A’s), we’re forming sets of n letters from the 
N = 2 other letters, B and C. In the second column (with one A), we’re forming 
sets of n — 1 letters from the N = 2 letters B and C. And so on, until the last 
column has one set with n A’s. For example, the possible outcomes for n = 6 
are shown in Table 1.22. 


BBBBBB 

ABBBBB 

AABBBB 

AAABBB 

AAAABB 

AAAAAB AAAAAA 

BBBBBC 

ABBBBC 

AABBBC 

AAABBC 

AAAABC 

AAAAAC 

BBBBCC 

ABBBCC 

AABBCC 

AAABCC 

AAAACC 


BBBCCC 

ABBCCC 

AABCCC 

AAACCC 



BBCCCC 

ABCCCC 

AACCCC 




BCCCCC 

CCCCCC 

ACCCCC 






Table 1.22: Unordered lists with n = 6 and N = 3. 


The same reasoning that led to Eq. (1.51) carries through here, and we end up 
with 

3 (/„ = 2 Un + lU n -l + 2^71-2 + ' ' ' + 2^1 + 2^0- (1-52) 

If we then invoke the 2 U n =77+1 result from Problem 1.12, we obtain 


31/77 — {n + 1 ) + ( 77 ) + (tj— l) + *** + 2+ l 
_ (72 + 1) (72 + 2) 

" 2 ’ 


(1.53) 


in agreement with the ("J 2 ) result in Eq. (1.16) for N = 3. We have used the 
result from Problem 1.5 that the sum of the first k integers equals k(k + l)/2, 
with k = 77 + 1 here. 

(b) In the case of general N (and 77 ), we can again group the sets of letters according 
to how many times a given individual letter (call it A) appears. If A doesn’t 
appear, then we’re forming sets of n letters from the N - 1 other letters, B, C, 

_If A appears once, then we're forming sets of 77 - 1 letters from the N - 1 

other letters. If A appears twice, then we're forming sets of 77 - 2 letters from the 
N - 1 other letters. And so on, until A appears all n times, and we’re forming 
sets of zero letters from the N - 1 other letters. (There’s only one way to do that; 
simply don’t pick any letters.) If we add up all of these possibilities, we obtain 


N^n ~ N-\Un + N-l^n-l + JV-l^n-2 H -+ N-\U\ + iV-l^O- (1-54) 

If we then invoke the inductive hypothesis that /v— i U n equals for any 

77 , from Eq. (1.16), we can rewrite Eq. (1.54) as 


N I'll 


77 + N - 2 

N-2 


77 + N - 3 
N-2 


77 + N - 4 
N-2 


N- 1 
N-2 


N - 2\ 

N-2)' 

(1.55) 
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But this sum takes exactly the same form as the sum in the Eq. (1.29) result in 
Problem 1.17, which we'll copy here: 


n - 1 
k - I 


n — 2 
k- 1 


n — 3 
k- 1 


k 

k - 1 


k - 1 
k- 1 


(1.56) 


When applying this equation, all we have to do is observe that the two entries 
in the binomial coefficient on the lefthand side are each 1 more than the corre¬ 
sponding entries in the first binomial coefficient on the righthand side. Applying 
this result to Eq. (1.55) yields 


N Cn 


In + N- 1\ 

l N-l J ’ 


(1.57) 


in agreement with Eq. (1.16), as desired. 

Note that if N = 1 (with arbitrary n), then Eq. (1.16) gives \U n = (q) = 1. This 
is correct, because if there is only N = 1 possible outcome (call it A) for each 
of the n trials, then there is only one possible combined outcome for all n trials, 
namely AAAA.... 

We have therefore shown two things: (1) Eq. (1.16) holds for N = 1 (and all n), 
and (2) if Eq. (1.16) holds for N — 1 (and all n) then it also holds for N (and 
all n). It therefore follows inductively that Eq. (1.16) holds for all N (and n), as 
desired. 


1.14. Yahtzee 

(a) As mentioned near the beginning of Section 1.7, a roll of five dice is equivalent 
to drawing n = 5 balls in succession from a box (with replacement, and with 
the order not mattering), with the balls being labeled with the N = 6 numbers 1 
through 6. So Eq. (1.16) does indeed apply. With n = 5 and N = 6, we obtain 
(= 252 possible rolls. 

(b) There are seven different basic types of unordered rolls (sets): 

1. All five numbers are the same, for example 11111: There are six sets of 
this type, because the common number can be 1, 2, 3, 4, 5, or 6. 

2. Four of one number and one of another, for example 11112: (Remember 
that the order doesn’t matter, so 11112, 11121, etc. are all equivalent.) 
There are 6 • 5 = 30 sets of this type, because there are six choices for the 
number that appears four times, and then for each of these choices there 
are five choices for the number that appears once. 

3. Three of one number and two of another, for example 11122: There are 
again 6 • 5 = 30 sets of this type, because there are six choices for the 
number that appears three times, and then five choices for the number that 
appears twice. 

4. Three of one number, one of a second, and one of a third, for example 
11123: There are 6 • 10 = 60 sets of this type, because there are six choices 
for the number that appears three times, and then ( 2 ) = 10 ways to choose 
the other two numbers from the remaining five. 

5. Two of one number, two of a second, and one of a third, for example 11223: 
There are again 6 • 10 = 60 sets of this type, because there are six choices 
for the number that appears once, and then ( 2 ) = 10 ways to choose the 
other two (repeated) numbers from the remaining five. 
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6 . Two of one number and one each of three other numbers, for example 
11234: There are 6 • 10 = 60 sets of this type, because there are six choices 
for the number that appears twice, and then (^) = 10 ways to choose the 
other three numbers from the remaining five. 

7. One each of five numbers, for example 12345: There are six sets of this 
type, because there are six ways to choose the number that doesn’t appear. 

Let’s summarize the above results for the numbers of each of the different types 
of unordered sets: 


11111 

11112 

11122 

11123 

6 

30 

30 

60 

11223 

11234 

12345 


60 

60 

6 



The total number of (unordered) 5-dice Yahtzee rolls is therefore 

6 + 30 + 30 + 60 + 60 + 60 + 6 = 252, (1.58) 

in agreement with the result in part (a). 

(c) We’ll now determine the number of ordered sets associated with each of the 
above seven types of unordered sets. 

1. All five numbers are the same: For a given unordered set of this type, there 
is only one way to order the numbers, because they are all the same. So 
the total number of ordered sets associated with the six unordered sets of 
the 11111 type is simply 6-1=6. 

2. Four of one number and one of another: For a given unordered set of this 
type, there are five ways to order the numbers, because there are five places 
to put the single number. So the total number of ordered sets associated 
with the 30 unordered sets of the 11112 type is 30 • 5 = 150. 

3. Three of one number and two of another: For a given unordered set of this 
type, there are (5) = 10 ways to order the numbers, because there are ( 2 ) 
places to put the two common numbers. So the total number of ordered sets 
associated with the 30 unordered sets of the 11122 type is 30 • 10 = 300. 

4. Three of one number, one of a second, and one of a third: For a given 
unordered set of this type, there are 20 ways to order the numbers, because 
there are five places to put one of the single numbers, and then four places 
to put the other. So the total number of ordered sets associated with the 60 
unordered sets of the 11123 type is 60 • 20 = 1200. 

5. Two of one number, two of a second, and one of a third: For a given 
unordered set of this type, there are 30 ways to order the numbers, because 
there are five places to put the single number, and then ( 2 ) = 6 ways to 
place one of the pairs. So the total number of ordered sets associated with 
the 60 unordered sets of the 11223 type is 60 • 30 = 1800. 

6. Two of one number and one each of three other numbers: For a given 
unordered set of this type, there are 60 ways to order the numbers, because 
there are five places to put one of the single numbers, four places to put the 
second, and three places to put the third. So the total number of ordered 
sets associated with the 60 unordered sets of the 11234 type is 60 • 60 = 
3600. 
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7. One each of five numbers: For a given unordered set of this type, there are 
120 ways to order the numbers, because there are 5! permutations of the 
five numbers. So the total number of ordered sets associated with the 6 
unordered sets of the 12345 type is 6 • 120 = 720. 

These results are summarized in Table 1.23. The entries in the “Unordered” 
row are the results from part (b) for the number of unordered sets of each type. 
Each entry in the “Ordered” row is the number of ordered sets for each of the 
unordered sets. For example, there are 5 ordered sets for each of the 30 un¬ 
ordered sets of the 11112 type. Each entry in the “Total” row is the total number 
of ordered sets of a certain type; this is the product of the entries in the “Un¬ 
ordered” and “Ordered” rows. The complete total number of ordered sets (rolls) 
involving n = 5 dice, each with N = 6 sides, is therefore 

6 + 150 + 300 + 1200 + 1800 + 3600 + 720 = 7776, (1.59) 

which equals 6, as desired. 


Type 

11111 

11112 

11122 

11123 

11223 

11234 

12345 

Unordered 

6 

30 

30 

60 

60 

60 

6 

Ordered 

1 

5 

10 

20 

30 

60 

120 

Total 

6 

150 

300 

1200 

1800 

3600 

720 


Table 1.23: Verifying that the total number of ordered rolls of five dice is 6 5 = 
7776. 


1.15. Pascal sum 1 

Using = n\/k\(n - k)\, the righthand side of Eq. (1.28) can be written as 


n — 1\ /zz - 1\ (n - 1)! (n — 1)! 

k-l) + \ k ) ~ (k - l)!(n - k)\ + k\(n -k- 1)! ' 


(1.60) 


Let's get a common denominator in these fractions, so that we can add them. The 
common denominator is k\(n - k)\, so multiplying the first fraction by k/k and the 
second by (n - k)/(n - k) gives 


In- 1\ //7 — 1 \ k(n- 1)! (n-k)(n-l)\ 

\£—1/ \ k ) k\(n - k)\ k\(n-k)\ 


If we cancel the ±k(n - 1)! terms in the numerators, we obtain 

In — 1 \ In - 1 ) 77(77 - 1 )! 

\k — 1/ \ k ) kl(n- k)! 

77! 

k!(n - k)\ 

M 


(1.61) 


(1.62) 


as desired. 
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1.16. Pascal sum 2 

(a) The binomial coefficients give the number of ways of obtaining k Heads in 
n coin flips. So to demonstrate the given relation, we want to show that the 
number of ways of obtaining k Heads in n coin flips equals the number of ways 
of obtaining k— 1 Heads in n -1 coin flips, plus the number of ways of obtaining 
k Heads in n - 1 coin flips. This is true due to the following reasoning. 

If we single out the first coin flip, we see that there are two basic ways to obtain 
k Heads: either we obtain a Heads on the first flip, or we don’t. How many 
possibilities are there for each of these two ways? If the first flip is a Heads, 
then the other k - 1 Heads must come from the remaining n - I flips. There are 
(£~j) ways for this to happen. If the first flip isn’t a Heads, then all k Heads 

must come from the remaining n — 1 flips. There are ways for this to 

happen. Since each of the total number of ways of obtaining k Heads falls 
into one or the other of these two categories, we therefore arrive at Eq. (1.22). 

(b) The binomial coefficients are the coefficients of the terms in the binomial ex¬ 
pansion of (a + b) n . So to demonstrate the given equation, we want to show that 
the coefficient of the term involving b k in (a + b) n equals the coefficient of the 
term involving b k ~^ in (a + b) n ~ l , plus the coefficient of the term involving b k 
in (a + fo)” -1 . This is true due to the following reasoning. 

Let’s write (a + b) n in the form of (a + b) ■ (a + b) n ~ l . and imagine multiplying 
out the (a + b) n ~ l part. The result contains many terms, but the two relevant 
ones are (gZ\)ci n ~ k b k ~ l and ( n ^a n ~ k ~ 1 b k . So we have 

(,a + b) n = (a + b) " JJ a n - k b k ~ x + “ *J a n ~ k ~ l b k + • • • J . 

(1.63) 

There are two ways to obtain a b k term on the righthand side. Either the b 
in the first factor gets multiplied by the {j.Z\) a>1 ~ k b k ~ l term in the second 

factor, or the a in the first factor gets multiplied by the b k term in 

the second factor. The net coefficient of the b k term on the righthand side is 
therefore (jjl}) + But the coefficient of the b k term on the lefthand side 

is so we have demonstrated Eq. (1.22). 

1.17. Pascal diagonal sum 

(a) The ( 2 ) comes from the fact that once we’ve chosen the first letter to be A, there 
are ( 2 ) = 6 ways to pick the other two letters from B, C, D, E. This yields 

the first column in the table. Likewise, the second column has ( 2 ) = 3 triplets 
starting with B and involving two letters from C, D, E. (We've already listed all 
the groups with A.) And the third column has ( 2 ) = 1 triplet starting with C and 
involving the two letters D, E. (We’ve already listed all the groups with A and 
B.) 

(b) Consider an arbitrary number in Pascal’s triangle, such as the one represented 
by the circled dot in the first triangle in Fig. 1.8. The number happens to be 

but the actual value isn’t important here. By Eq. (1.22) this number equals 
the sum of the two numbers above it, as shown in the second triangle. At every 
stage from here on, we will replace the righthand of the two numbers (that were 
just circled) with the two numbers above it. This doesn’t affect the sum, due to 
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Eq. (1.22). The number that just got replaced will be shown with a dotted circle. 
The end result is the four circled numbers in the fifth triangle in the figure; this 
is the desired diagonal string of numbers. Since the sum is unaffected by the 
replacements at each stage, the sum of the numbers in the diagonal string equals 
the original number in the first triangle. In this specific case, we have shown 
that ( 2 ) = (|) + ( 1 ) + ( 1 ) + (j) ■ But the result holds for any starting point. 


® 


• • 


• • • • 


• •(•)••• 


© 

W • • 

• • • 

• • • • 

• • • • 


© * 

^ • • 

• • • 

• ® ® • 

• ® ® • • 


® 

w • • 

• ® ® 

• ® # • 

• ® • • • 


o • ® 

• ® ® 

• ® • • 

• • • • • 


Figure 1.8: Illustration of = (j) + + (j) + (J). Each number (dot) can 

be replaced with the two numbers above it. 


Remark: We can give another proof of Eq. (1.29) by generalizing what we 
observed about Table 1.19 in part (a). Let’s imagine picking a committee of k 
people from n people, and let’s label the people as 1, 2, 3, etc. When we list out 
the possible committees, we can arrange them in groups according to what 
the lowest number in the committee is. For example, some committees have a 
1 ; other committees don’t have a 1 but have a 2; other committees don’t have 
a 1 or a 2 but have a 3; and so on. How many committees are there of each of 
these types? 

If the lowest number is a 1, then the other k - 1 people on the committee must 
be chosen from the n - I people who are 2 or higher. There are (^l[) ways to do 
this. Similarly, if the lowest number is a 2, then the other k - 1 people must be 
chosen from the n — 2 people who are 3 or higher. There are (^Ij) ways to do 
this. Likewise, if the lowest number is a 3, then the other k - 1 people must be 
chosen from the n - 3 people who are 4 or higher. There are (]©) ways to do 
this. This method of counting continues until we reach the stage where there are 
only k — 1 numbers higher than the lowest one (which occurs when the lowest 
number equals n — (k - 1)), in which case there is just (^1 j) = 1 way to choose 

the other k— 1 people. Since the total number of possible committees is we 
therefore arrive at Eq. (1.29). * 


Chapter 2 

Probability 


Having learned in Chapter 1 how to count things, we can now talk about probability. 
We will find that in many situations it is a trivial matter to generate probabilities 
from our counting results. So we will be justly rewarded for the time and effort we 
spent in Chapter 1. 

The outline of this chapter is as follows. In Section 2.1 we give the definition 
of probability. Although this definition is fairly easy to apply in most cases, there 
are a number of subtleties that come up. These are discussed in Appendix A. In 
Section 2.2 we present the various rules of probability. We show how these can 
be applied in a few simple examples, and then we work through a number of more 
substantial examples in Section 2.3. In Section 2.4 we present four classic prob¬ 
ability problems that many people find counterintuitive. Section 2.5 is devoted to 
Bayes’ theorem , which is a relation between certain conditional probabilities. Fi¬ 
nally, in Section 2.6 we discuss Stirling’s formula, which gives an approximation to 
the ubiquitous factorial, n !. 


2.1 Definition of probability 

Probability gives a measure of how likely it is for something to happen. It can be 
defined as follows: 

Definition of probability: Consider a very large number of identical trials 
of a certain process; for example, flipping a coin, rolling a die, picking a ball 
from a box (with replacement), etc. If the probability of a particular event 
occurring (for example, getting a Heads, rolling a 5, or picking a blue ball) is 
p, then the event will occur in a fraction p of the trials, on average. 

Some examples are: 

• The probability of getting a Heads on a coin flip is 1/2 (or equivalently 50%). 
This is true because the probabilities of getting a Heads or a Tails are equal, 
which means that these two outcomes must each occur half of the time, on 
average. 
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• The probability of rolling a 5 on a standard 6-sided die is 1/6. This is true 
because the probabilities of rolling a 1, 2, 3, 4, 5, or 6 are all equal, which 
means that these six outcomes must each happen one sixth of the time, on 
average. 

• If there are three red balls and seven blue balls in a box, then the probabilities 
of picking a red ball or a blue ball are, respectively, 3/10 and 7/10. This 
follows from the fact that the probabilities of picking each of the ten balls are 
all equal (or at least let’s assume they are), which means that each ball will be 
picked one tenth of the time, on average. Since there are three red balls, a red 
ball will therefore be picked 3/10 of the time, on average. And since there 
are seven blue balls, a blue ball will be picked 7/10 of the time, on average. 

Note the inclusion of the words “on average” in the above definition and examples. 
We’ll discuss this in detail in the subsection below. 

Many probabilistic situations have the property that they involve a number of 
different possible outcomes, all of which are equally likely. For example. Heads 
and Tails on a coin are equally likely to be tossed, the numbers 1 through 6 on a die 
are equally likely to be rolled, and the ten balls in the above box are all equally likely 
to be picked. In such a situation, the probability of a certain scenario happening is 
given by 


P = 


number of desired outcomes 
total number of possible outcomes 


(for equally likely outcomes) (2.1) 


Calculating a probability then simply reduces to a matter of counting the number 
of desired outcomes, along with the total number of outcomes. For example, the 
probability of rolling an even number on a die is 1/2, because there are three desired 
outcomes (2, 4, and 6) and six total possible outcomes (the six numbers). And the 
probability of picking a red ball in the above example is 3/10, as we already noted, 
because there are three desired outcomes (picking any of the three red balls) and 
ten total possible outcomes (the ten balls). These two examples involved trivial 
counting, but we’ll encounter many examples where it is more involved. This is 
why we did all of that counting in Chapter 1! 

It should be stressed that Eq. (2.1) holds only under the assumption that all of 
the possible outcomes are equally likely. But this usually isn’t much of a restriction, 
because this assumption will generally be valid in the setups we’ll be dealing with in 
this book. In particular, it holds in setups dealing with permutations and subgroups, 
both of which we studied in detail in Chapter 1. Our ability to count these sorts of 
things will allow us to easily calculate probabilities via Eq. (2.1). Many examples 
are given in Section 2.3 below. 

There are three words that people often use interchangeably: “probability,” 
“chance,” and “odds.” The first two of these mean the same thing. That is, the 
statement, “There is a 40% chance that the bus will be late,” is equivalent to the 
statement, “There is a 40% probability that the bus will be late.” However, the word 
“odds” has a different meaning; see Problem 2.1 for a discussion of this. 
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The importance of the words “on average” 

The above definition of probability includes the words “on average.” These words 
are critical, because the definition wouldn’t make any sense if we omitted them and 
instead went with something like: “If the probability of a particular event occurring 
is p , then the event will occur in exactly a fraction p of the trials.” This can’t be a 
valid definition of probability, for the following reason. Consider the roll of one die, 
for which the probability of each number occurring is 1/6. This definition would 
imply that on one roll of a die, we will get 1/6 of a 1, and 1/6 of a 2, and so on. But 
this is nonsense; you can’t roll 1/6 of a 1. The number of times a 1 appears on one 
roll must of course be either zero or one. And in general for many rolls, the number 
must be an integer, 0, 1, 2, 3, .... 

There is a second problem with this definition, in addition to the problem of non 
integers. What if we roll a die six times? This definition would imply that we will 
get exactly (1/6) • 6 = 1 of each number. This prediction is a little better, in that 
at least the proposed numbers are integers. But it still can’t be correct, because if 
you actually do the experiment and roll a die six times, you will find that you are 
certainly not guaranteed to get each of the six numbers exactly once. This scenario 
might happen (we’ll calculate the probability in Section 2.3.4 below), but it is more 
likely that some numbers will appear more than once, while other numbers won’t 
appear at all. 

Basically, for a small number of trials (such as six), the fractions of the time that 
the various events occur will most likely not look much like the various probabili¬ 
ties. This is where the words “very large number” in our original definition come 
in. The point is that if you roll a die a huge number of times, then the fractions of 
the time that each of the six numbers appears will be approximately equal to 1/6. 
And the larger the number of rolls, the closer the fractions will generally be to 1/6. 

In Chapter 5 we’ll explain why the fractions are expected to get closer and closer 
to the actual probabilities, as the number of trials gets larger and larger. For now, just 
take it on faith that if you flip a coin 100 times, the probability of obtaining either 
49, 50, or 51 Heads isn’t so large. It happens to be about 24%, which tells you 
that there is a decent chance that the fraction of Heads will deviate moderately from 
1/2. However, if you flip a coin 100,000 times, the probability of obtaining Heads 
between 49% and 51% of the time is 99.999999975%, which tells you that there is 
virtually no chance that the fraction of Heads will deviate much from 1 /2. If you 
increase the number of flips to 10 9 (a billion), this result is even more pronounced; 
the probability of obtaining Heads in the narrow range between 49.99% and 50.01% 
of the time is 99.999999975% (the same percentage as above). We’ll discuss such 
matters in detail in Section 5.2. For more commentary on the words “on average,” 
see the last section in Appendix A. 


2.2 The rules of probability 

So far we’ve talked only about the probabilities of single events, for example, rolling 
an even number on a die, getting a Heads on a coin toss, or picking a blue ball 
from a box. We’ll now consider two (or more) events. Reasonable questions we 
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can ask are: What is the probability that both of the events occur? What is the 
probability that either of the events occurs? The rules presented below will answer 
these questions. We’ll provide a few simple examples for each rule, and then we’ll 
work through some longer examples in Section 2.3. 


2.2.1 AND: The “intersection” probability, P(A and B) 

Let A and B be two events. For example, if we roll two dice, we can let A = {rolling 
a 2 on the left die} and B = {rolling a 5 on the right die}. Or we might have A = 
{picking a red ball from a box} and B - {picking a blue ball without replacement 
after the first pick}. What is the probability that A and B both occur? In answering 
this question, we must consider two cases: (1) A and B are independent events, or 
(2) A and B are dependent events. Let’s look at each of these in turn. In each case, 
the probability that A and B both occur is known as the joint probability. 


Independent events 

Two events are said to be independent if they don’t affect each other, or more pre¬ 
cisely, if the occurrence of one doesn’t affect the probability that the other occurs. 
An example is the first setup mentioned above - rolling two dice, with A — {rolling 
a 2 on the left die} and B = {rolling a 5 on the right die}. The probability of ob¬ 
taining a 5 on the right die is 1/6, independent of what happens with the left die. 
And similarly the probability of obtaining a 2 on the left die is 1/6, independent of 
what happens with the right die. Independence requires that neither event affects 
the other. The events in the second setup mentioned above with the balls in the box 
are not independent; we’ll talk about this below. 

Another example of independent events is picking one card from a deck, with 
A = {the card is a king} and B = {the (same) card is a heart}. The probability of 
the card being a heart is 1/4, independent of whether or not it is a king. And the 
probability of the card being a king is 1/13, independent of whether or not it is a 
heart. Note that it is possible to have two different events even if we have only one 
card. This card has two qualities (its suit and its value), and we can associate an 
event with each of these qualities. 

Remark: A note on terminology: The words “event” and “outcome” sometimes mean the 
same thing in practice, but there is technically a difference. An outcome is the result of an 
experiment. If we draw a card from a deck, then there are 52 possible outcomes; for example, 
the 4 of clubs, the jack of diamonds, etc. An event is a set of outcomes. For example, an event 
might be “drawing a heart.” This event contains 13 outcomes, namely the 13 cards that are 
hearts. A given card may belong to many events. For example, in addition to belonging to the 
A and B events in the preceding paragraph, the king of hearts belongs to the events C = {the 
card is red}, D = {the card’s value is higher than 8), E = {the card is the king of hearts), 
and so on. As indicated by the event E, an event may consist of a single outcome. An event 
may also be the empty set (which occurs with probability 0), or the entire set of all possible 
outcomes (which occurs with probability 1), which is known as the sample space. * 
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The “And" rule for independent events is: 


• If events A and B are independent, then the probability that they both occur 
equals the product of their individual probabilities: 


P(A and B) = P(A) ■ P(B) 


( 2 . 2 ) 


We can quickly apply this rule to the two examples mentioned above. The prob¬ 
ability of rolling a 2 on the left die and a 5 on the right die is 

PC2 and 5) = P( 2) • P( 5) = Ll = ±. (2.3) 

6 6 36 

This agrees with the fact that one out of the 36 pairs of (ordered) numbers in Table 
1.5 is “2,5.” Similarly, the probability that a card is both a king and a heart is 

/ J (king and heart) = P(king) • P (heart) = (2.4) 

This makes sense, because one of the 52 cards in a deck is the king of hearts. 

The logic behind Eq. (2.2) is the following. Consider N trials of a given process, 
where N is very large. In the case of the two dice, a trial consists of rolling both 
dice. The outcome of such a trial takes the form of an ordered pair of numbers. The 
first number is the result of the left roll, and the second number is the result of the 
right roll. On average, the fraction of the outcomes that have a 2 as the first number 
is (1/6) • N. 

Let’s now consider only this “2-first" group of outcomes and ignore the rest. 
Then on average, a fraction 1/6 of these outcomes have a 5 as the second number. 
This is where we are invoking the independence of the events. As far as the second 
roll is concerned, the set of (1 /6) • N trials that have a 2 as the first roll is no different 
from any other set of (1/6)-A trials, so the probability of obtaining a 5 on the second 
roll is simply 1/6. Putting it all together, the average number of trials that have both 
a 2 as the first number and a 5 as the second number is 1/6 of (1/6) • N , which 
equals (1/6) -(1/6) -N. 

In the case of general probabilities P(A) and P(B), it is easy to see that the two 
(l/6)’s in the above result get replaced by P(A) and P(B). So the average number 
of outcomes where A and B both occur is P(A)-P(B)-N. And since we performed N 
trials, the fraction of outcomes where A and B both occur is P(A)-P(B), on average. 
From the definition of probability in Section 2.1, this fraction is the probability that 
A and B both occur, in agreement with Eq. (2.2). 

If you want to think about the rule in Eq. (2.2) in terms of a picture, then consider 
Fig. 2.1. Without worrying about specifics, let’s assume that different points within 
the overall square represent different outcomes. And let’s assume that they’re all 
equally likely, which means that the area of a region gives the probability that an 
outcome located in that region occurs (assuming that the area of the whole region is 
1). The figure corresponds to P(A) = 0.2 and P(B) = 0.4. Outcomes to the left of 
the vertical line are ones where A occurs, and outcomes to the right of the vertical 
line are ones where A doesn’t occur. Likewise for B and outcomes above and below 
the horizontal line. 
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20% of the width 


not A 


B and not A 


the height 


not B 


not A and not B 


A and not B 



Figure 2.1: A probability square for independent events. 


From the figure, we see that not only is 40% of the entire square above the 
vertical line, but also that 40% of the left vertical strip (where A occurs) is above 
the vertical line, and likewise for the right vertical strip (where A doesn’t occur). 
In other words, B occurs 40% of the time, independent of whether or not A occurs. 
Basically, B couldn’t care less what happens with A. Similar statements hold with 
A and B interchanged. So this type of figure, with a square divided by horizontal 
and vertical lines, does indeed represent independent events. 

The darkly shaded “A and B” region is the intersection of the region to the left 
of the vertical line (where A occurs) and the region above the horizontal line (where 
B occurs). Hence the word “intersection” in the title of this section. The area of 
the darkly shaded region is 20% of 40% (or 40% of 20%) of the total area, that is, 
(0.2)(0.4) = 0.08 of the total area. The total area corresponds to a probability of 1, 
so the darkly shaded region corresponds to a probability of 0.08. Since we obtained 
this probability by multiplying P(A) by P{B), we have therefore given a pictorial 
proof of Eq. (2.2). 

Dependent events 

Two events are said to be dependent if they do affect each other, or more precisely, if 
the occurrence of one does affect the probability that the other occurs. An example 
is picking two balls in succession from a box containing two red balls and three 
blue balls (see Fig. 2.2), with A = {choosing a red ball on the first pick} and B = 
{choosing a blue ball on the second pick, without replacement after the first pick}. 
If you pick a red ball first, then the probability of picking a blue ball second is 3/4, 
because there are three blue balls and one red ball left. On the other hand, if you 
don't pick a red ball first (that is, if you pick a blue ball first), then the probability of 
picking a blue ball second is 2/4, because there are two red balls and two blue balls 
left. So the occurrence of A certainly affects the probability of B. 

Another example might be something like: A = {it rains at 6:00} and B = {you 
walk to the store at 6:00}. People are generally less likely to go for a walk when 
it’s raining outside, so (at least for most people) the occurrence of A affects the 
probability of B. 
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Figure 2.2: A box with two red balls and three blue balls. 


The “And” rule for dependent events is: 


• If events A and B are dependent, then the probability that they both occur 
equals 


P(A and B) = P(A) ■ P(B\A) 


(2.5) 


where P(B\A ) stands for the probability that B occurs, given that A occurs. 
It is called a “conditional probability,” because we are assuming a given 
condition, namely that A occurs. It is read as “the probability of B, given A.” 


There is actually no need for the “dependent” qualifier in the first line of this rule, 
as we’ll see in the second remark near the end of this section. 

The logic behind Eq. (2.5) is the following. Consider N trials of a given process, 
where N is very large. In the above setup with the balls in a box, a “trial” consists 
of picking two balls in succession, without replacement. On average, the fraction 
of the outcomes in which a red ball is drawn on the first pick is P(A) ■ N. Let’s 
now consider only these outcomes and ignore the rest. Then a fraction P{B\A ) of 
these outcomes have a blue ball drawn second, by the definition of P(B\A). So 
the number of outcomes where A and B both occur is P(B\A ) ■ P(A) ■ N. And 
since we performed N trials, the fraction of outcomes where A and B both occur is 
P(A) ■ P{B\A), on average. This fraction is the probability that A and B both occur, 
in agreement with the rule in Eq. (2.5). 

The reasoning in the previous paragraph is equivalent to the mathematical iden¬ 
tity, 

^AandB ^AandB 

- — - ' - , (2.6) 

NNn A 

where n A is the number of trials where A occurs, etc. By definition, the lefthand 
side of this equation equals P(A and B), the first term on the righthand side equals 
P(A), and the second term on the righthand side equals P(B\A). So Eq. (2.6) is 
equivalent to the relation. 


P(A and B) = P(A) ■ P(B\A), (2.7) 

which is Eq. (2.5). In terms of the Venn-diagram type of picture in Fig. 2.3, Eq. (2.6) 
is the statement that the darkly shaded area (which represents P(A and B )) equals 
the area of the A region (which represents P(A)) multiplied by the fraction of the 
A region that is taken up by the darkly shaded region. This fraction is P(B\A), by 
definition. 
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Figure 2.3: Venn diagram for probabilities of dependent events. 


As in Fig. 2.1, we’re assuming in Fig. 2.3 that different points within the over¬ 
all boundary represent different outcomes, and that they’re all equally likely. This 
means that the area of a region gives the probability that an outcome located in 
that region occurs (assuming that the area of the whole region is 1). We’re using 
Fig. 2.3 for its qualitative features only, so we’re drawing the various regions as 
general blobs, as opposed to the specific rectangles in Fig. 2.1, which we used for a 
quantitative calculation. 

Because the “A and B” region in Fig. 2.3 is the intersection of the A and B 
regions, and because the intersection of two sets is usually denoted by A n B, you 
will often see the P(A and B) probability written as P(A n B). That is. 


P(A n B) = P(A and B). 


( 2 . 8 ) 


But we’ll stick with the P(A and B) notation in this book. 

There is nothing special about the order of A and B in Eq. (2.5). We could just 
as well interchange the letters and write P(B and A) = P(B)- P(A\B). However, we 
know that P(B and A) = P(A and B), because it doesn’t matter which event you say 
first when you say that two events both occur. So we can also write P(A and B) = 
P{B ) ■ P(A\B). Combining this with Eq. (2.5), we see that we can write P(A and B) 
in two different ways: 


P(A and B) = P(A ) ■ P(B\A) 
= P(B) ■ P(A\B). 


(2.9) 


The fact that P(A and B) can be written in these two ways will be critical when we 
discuss Bayes’ theorem in Section 2.5. 


Example (Balls in a box): Let’s apply Eq. (2.5) to the setup with the balls in the box 
in Fig. 2.2 above. Let A = (choosing a red ball on the first pick) and B = (choosing a 
blue ball on the second pick, without replacement after the first pick). For shorthand, 
we'll denote these events by Red) and Blue 2 , where the subscript refers to the first 
or second pick. We noted above that P(Blue 2 |Redj) = 3/4. And we also know that 
/’(Redi) is simply 2/5, because there are initially two red balls and three blue balls. 
So Eq. (2.5) gives the probability of picking a red ball first and a blue ball second 
(without replacement after the first pick) as 



( 2 . 10 ) 
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We can verify that this is correct by listing out all of the possible pairs of balls that can 
be picked. If we label the balls as 1, 2, 3, 4, 5, and if we let 1, 2 be the red balls, and 
3, 4, 5 be the blue balls, then the possible outcomes are shown in Table 2.1. The first 
number stands for the first ball picked, and the second number stands for the second 
ball picked. 


Red first Blue first 


Red second 

12 

21 

3 1 

32 

41 

42 

5 1 

52 


13 

23 

— 

43 

53 

Blue second 

14 

24 

34 

— 

54 


15 

25 

35 

45 

— 


Table 2.1: Ways to pick two balls from the box in Fig. 2.2, without replacement. 

The “—” entries stand for the outcomes that aren't allowed; we can’t pick two of the 
same ball, because we’re not replacing the ball after the first pick. The dividing lines 
are drawn for clarity. The internal vertical line separates the outcomes where a red 
or blue ball is drawn on the first pick, and the internal horizontal line separates the 
outcomes where a red or blue ball is drawn on the second pick. The six pairs in the 
lower left corner are the outcomes where a red ball (numbered 1 and 2) is drawn first 
and a blue ball (numbered 3, 4, and 5) is drawn second. Since there are 20 possible 
outcomes in all, the desired probability is 6/20 = 3/10, in agreement with Eq. (2.10). 
Table 2.1 also gives a verification of the P(Redi) and / > (Bluei]Redi) probabilities we 
wrote down in Eq. (2.10). P(Redi) equals 2/5 because eight of the 20 entries are to 
the left of the vertical line. And P(Blue 2 |Redi) equals 3/4 because six of these eight 
entries are below the horizontal line. 

The task of Problem 2.4 is to verify that the second expression in Eq. (2.9) also gives 
the correct result for P(Redj and Blue 2 ) in this setup. 


We can think about the rule in Eq. (2.5) in terms of a picture analogous to 
Fig. 2.1. If we consider the above example with the red and blue balls, then the 
first thing we need to do is recast Table 2.1 in a form where equal areas yield equal 
probabilities. If we get rid of the “—” entries in Table 2.1, then all entries have 
equal probabilities, and we end up with Table 2.2. 


12 

21 

3 1 

41 

51 

13 

23 

32 

42 

52 

14 

24 

34 

43 

53 

15 

25 

35 

45 

54 


Table 2.2: Rewriting Table 2. 1 . 

In the spirit of Fig. 2.1, this table becomes the square shown in Fig. 2.4. The 
upper left region corresponds to red balls on both picks. The lower left region 
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corresponds to a red ball and then a blue ball. The upper right region corresponds to 
a blue ball and then a red ball. And the lower right region corresponds to blue balls 
on both picks. This figure makes it clear why we formed the product (2/5) • (3/4) 
in Eq. (2.10). The 2/5 gives the fraction of the outcomes that lie to the left of the 
vertical line (these are the ones that have a red ball first), and the 3/4 gives the 
fraction of these outcomes that lie below the horizontal line (these are the ones that 
have a blue ball second). The product of these fractions gives the overall fraction 
(namely 3/10) of the outcomes that lie in the lower left region. 


Rl and R 2 


25% of 
the height 


40% of the width 
Bi 



50% of 
the height 


Figure 2.4: Pictorial representation of Table 2.2. 

The main difference between Fig. 2.4 and Fig. 2.1 is that the one horizontal 
line in Fig. 2.1 is now two different horizontal lines in Fig. 2.4. The heights of the 
horizontal lines in Fig. 2.4 depend on which vertical strip we’re dealing with. This 
is the visual manifestation of the fact that the red/blue probabilities on the second 
pick depend on what happens on the first pick. 

Remarks: 

1. The method of explicitly counting the possible outcomes in Table 2.1 shows that you 
don’t have to use the rule in Eq. (2.5), or similarly the rule in Eq. (2.2), to calculate 
probabilities. You can often instead just count up the various outcomes and solve the 
problem from scratch. However, the rules in Eqs. (2.2) and (2.5) allow you to take 
a shortcut that avoids listing out all the outcomes, which might be rather difficult if 
you’re dealing with large numbers. 

2. The rule in Eq. (2.2) for independent events is a special case of the rule in Eq. (2.5) 
for dependent events. This is true because if A and B are independent, then P(B\A) is 
simply equal to P{B), because the probability of B occurring is just P(B), independent 
of whether or not A occurs. Eq. (2.5) then reduces to Eq. (2.2) when P(B\A) = P(B). 
Therefore, there was technically no need to introduce Eq. (2.2) first. We could have 
started with Eq. (2.5), which covers all possible scenarios, and then showed that it 
reduces to Eq. (2.2) when the events are independent. But pedagogically, it is often 
better to start with a special case and then work up to the more general case. 

3. In the above “balls in a box” example, we encountered the conditional probabil¬ 
ity P(Blue 2 |Redi). We can also talk about the “reversed” conditional probability, 
P(Redi |Blue 2 ). However, since the second pick happens after the first pick, you 
might wonder how much sense it makes to talk about the probability of the Red; 
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event, given the Blue 2 event. Does the second pick somehow influence the first pick, 
even though the second pick hasn’t happened yet? When you make the first pick, are 
you being affected by a mysterious influence that travels backward in time? 

No, and no. When we talk about R(Redi |Blue 2 ), or about any other conditional 
probability in the example, everything we might want to know can be read off from 
Table 2.1. Once the table has been created, we can forget about the temporal or¬ 
der of the events. By looking at the Blue 2 pairs (below the horizontal line), we see 
that PfRed) |Blue 2 ) = 6/12 = 1/2. This should be contrasted with R(Redi |Red 2 ), 
which is obtained by looking at the Red 2 pairs (above the horizontal line); we find 
that _P(Redi |Red 2 ) = 2/8 = 1/4. Therefore, the probability that your first pick is red 
does depend on whether your second pick is blue or red. But this doesn't mean that 
there is a backward influence in time. All it says is that if you perform a large number 
of trials of the given process (drawing two balls, without replacement), and if you look 
at all of the cases where your second pick is blue (or conversely, red), then you will 
find that your first pick is red in 1 /2 (or conversely, 1/4) of these cases, on average. In 
short, the second pick has no causal influence on the first pick, but the after-the-fact 
knowledge of the second pick affects the probability of what the first pick was. 

4. A trivial yet extreme example of dependent events is the two events: A, and “not AT 
The occurrence of A highly affects the probability of “not A” occurring. If A occurs, 
then “not A” occurs with probability zero. And if A doesn't occur, then “not A” occurs 
with probability 1. * 

In the second remark above, we noted that if A and B are independent (that is, 
if the occurrence of one doesn’t affect the probability that the other occurs), then 
P(B\A) = P(B). Similarly, we also have P(A\B ) = P(A). Let’s prove that one of 
these relations implies the other. Assume that P(B\A) = P(B). Then if we equate 
the two righthand sides of Eq. (2.9) and use Pdf A) = P(B) to replace P(B\A) with 
P(B), we obtain 


P(A) ■ P(B\A) = P(B) ■ P(A\B) 

==> P(A) ■ P(B) = P(B ) • P(A\B ) 

=> P(A) = P(A\B). (2.11) 


So P(B\A) = P(B ) implies P(A\B ) = P(A ), as desired. In other words, if B is 
independent of A, then A is also independent of B. We can therefore talk about 
two events being independent, without worrying about the direction of the indepen¬ 
dence. The condition for independence is therefore either of the relations. 


P(B\A ) = P(B ) or P(A\B) = P(A ) 


(independence) 


( 2 . 12 ) 


Alternatively, the condition for independence may be expressed by Eq. (2.2), 


P(A and B) = P(A ) • P(B) 


(independence) 


(2.13) 


because this equation implies (by comparing it with Eq. (2.5), which is valid in any 
case) that P(B\A) = P(B ). 
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2.2.2 OR: The “union” probability, P(A or B ) 

Let A and B be two events. For example, let A = {rolling a 2 on a die} and B = 
{rolling a 5 on the same die}. Or we might have A = {rolling an even number 
(that is, 2, 4, or 6) on a die} and B = {rolling a multiple of 3 (that is, 3 or 6) on 
the same die}. A third example is A = {rolling a 1 on one die} and B = {rolling 
a 6 on another die}. What is the probability that either A or B (or both) occurs? 
In answering this question, we must consider two cases: (1) A and B are exclusive 
events, or (2) A and B are nonexclusive events. Let’s look at each of these in turn. 

Exclusive events 

Two events are said to be exclusive if one precludes the other. That is, they can’t both 
happen. An example is rolling one die, with A = {rolling a 2 on the die} and B = 
{rolling a 5 on the same die}. These events are exclusive because it is impossible 
for one number to be both a 2 and a 5. (The events in the second and third scenarios 
mentioned above are not exclusive; we’ll talk about this below.) Another example 
is picking one card from a deck, with A = {the card is a diamond} and B — {the 
card is a heart}. These events are exclusive because it is impossible for one card to 
be both a diamond and a heart. 

The “Or” rule for exclusive events is: 

• If events A and B are exclusive, then the probability that either of them occurs 
equals the sum of their individual probabilities: 


P(A or B) = P(A ) + P(B) 


(2.14) 


The logic behind this rule boils down to Fig. 2.5. The key feature of this figure 
is that there is no overlap between the two regions, because we are assuming that A 
and B are exclusive. If there were a region that was contained in both A and B. then 
the outcomes in that region would be ones for which A and B both occur, which 
would violate the assumption that A and B are exclusive. The rule in Eq. (2.14) is 
simply the statement that the area of the union (hence the word “union” in the title 
of this section) of regions A and B equals the sum of their areas. There is nothing 
fancy going on here. This statement is no deeper than the statement that if you have 
two separate bowls, the total number of apples in the two bowls equals the number 
of apples in one bowl plus the number of apples in the other bowl. 

We can quickly apply this rule to the two examples mentioned above. In the 
example with the die, the probability of rolling a 2 or a 5 on one die is 

P(2 or 5) = P(2) + P(5) = 7 + 7 = ^ • (2-15) 

6 6 3 

This makes sense, because two of the six numbers on a die are the 2 and the 5. In 
the card example, the probability of a card being either a diamond or a heart is 

/^diamond or heart) = /’(diamond) + /’(heart) = - + - = (2.16) 
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Figure 2.5: Venn diagram for the probabilities of exclusive events. 


This makes sense, because half of the 52 cards in a deck are diamonds or hearts. 

A special case of Eq. (2.14) is the “Not” rule, which follows from letting B = 
“not A.” 


P(A or (not A)) = P(A) + Pfnot A) 

=> 1 = P(A) + P(not A) 

F*(not A) = 1 - P(A). (2.17) 

The first equality here follows from Eq. (2.14), because A and “not A” are certainly 
exclusive events; you can’t both have something and not have it. To obtain the 
second line in Eq. (2.17), we have used P(A or (not A)) = 1, which holds because 
every possible outcome belongs to either A or “not A.” 

Nonexclusive events 

Two events are said to be nonexclusive if it is possible for both to happen. An 
example is rolling one die, with A = {rolling an even number (that is, 2, 4, or 6)} 
and B - {rolling a multiple of 3 (that is, 3 or 6) on the same die}. If you roll a 6, 
then A and B both occur. Another example is picking one card from a deck, with 
A = {the card is a king} and B = {the card is a heart}. If you pick the king of hearts, 
then A and B both occur. 

The “Or” rule for nonexclusive events is: 

• If events A and B are nonexclusive, then the probability that either (or both) 
of them occurs equals 


P(A or B) = P(A) + P(B ) - P(A and B) 


(2.18) 


The “or” here is the so-called “inclusive or,” in the sense that we say “A or B occurs” 
if either or both of the events occur. As with the “dependent” qualifier in the “And” 
rule in Eq. (2.5), there is actually no need for the “nonexclusive” qualifier in the 
“Or” rule here, as we’ll see in the third remark below. 

The logic behind Eq. (2.18) boils down to Fig. 2.6. The rule in Eq. (2.18) is the 
statement that the area of the union of regions A and B equals the sum of their areas 
minus the area of the overlap. This subtraction is necessary so that we don’t double 
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count the region that belongs to both A and B. This region isn’t “doubly good” 
just because it belongs to both A and B. As far as the “A or B" condition goes, the 
overlap region is just the same as any other part of the union of A and B. 



Figure 2.6: Venn diagram for the probabilities of nonexclusive events. 

In terms of a physical example, the rule in Eq. (2.18) is equivalent to the state¬ 
ment that if you have two bird cages that have a region of overlap, then the total 
number of birds in the cages equals the number of birds in one cage, plus the num¬ 
ber in the other cage, minus the number in the overlap region. In the situation shown 
in Fig. 2.7, we have 7 + 5 - 2 = 10 birds (which oddly all happen to be flying at the 
given moment). 



Things get more complicated if you have three or more events and you want to 
calculate probabilities like P(A or B or C). But in the end, the main task is to keep 
track of the overlaps of the various regions; see Problem 2.2. 

Because the “A or IP' region in Fig. 2.6 is the union of the A and B regions, and 
because the union of two sets is usually denoted by A U B, you will often see the 
P(A or B) probability written as P(A U B). That is, 

P(A UJ))i P(A or B). (2.19) 

But we’ll stick with the P(A or B) notation in this book. 

We can quickly apply Eq. (2.18) to the two examples mentioned above. In the 
example with the die, the only way to roll an even number and a multiple of 3 on a 
single die is to roll a 6, which happens with probability 1/6. So Eq. (2.18) gives the 
probability of rolling an even number or a multiple of 3 as 

P(even or mult of 3) = P(even) + P(mult of 3) - P(even and mult of 3) 

1 1 1 _ 4 _ 2 

_ 2 + 3 _ 6 _ 6 _ 3' 


( 2 . 20 ) 
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This makes sense, because four of the six numbers on a die are even numbers or 
multiples of 3, namely 2, 3, 4, and 6. (Remember that whenever we use “or,” it 
means the “inclusive or.”) We subtracted off the 1/6 in Eq. (2.20) so that we didn’t 
double count the roll of a 6. 

In the card example, the only way to pick a king and a heart with a single card 
is to pick the king of hearts, which happens with probability 1/52. So Eq. (2.18) 
gives the probability that a card is a king or a heart as 


/ J (king or heart) = P(king) + R (heart) - /Thing and heart) 
1 1 1 _ 16 _ 4 

~13 + 4 _ 52 _ 52 _ l3' 


( 2 . 21 ) 


This makes sense, because 16 of the 52 cards in a deck are kings or hearts, namely 
the 13 hearts, plus the kings of diamonds, spades, and clubs; we already counted the 
king of hearts. As in the previous example with the die, we subtracted off the 1/52 
here so that we didn’t double count the king of hearts. 

Remarks: 

1. If you want, you can think of the area of the union of A and B in Fig. 2.6 as the area of 
only A. plus the area of only B. plus the area of “A and B.” (Equivalently, the number 
of birds in the cages in Fig. 2.7 is 5 + 3 + 2 = 10.) This is easily visualizable, because 
these three areas are the ones you see in the figure. However, the probabilities of only 
A and of only B are often a pain to deal with, so it's generally easier to think of the 
area of the union of A and B as the area of A, plus the area of B, minus the area of the 
overlap. This way of thinking corresponds to Eq. (2.18). 

2. As we mentioned in the first remark on page 66, you don’t have to use the above 
rules of probability to calculate things. You can often instead just count up the various 
outcomes and solve the problem from scratch. In many cases you’re doing basically 
the same thing with the two methods, as we saw in the above examples with the die 
and the cards. 

3. As with Eqs. (2.2) and (2.5), the rule in Eq. (2.14) for exclusive events is a special 
case of the rule in Eq. (2.18) for nonexclusive events. This is true because if A and 
B are exclusive, then P(A and B) = 0, by definition. Eq. (2.18) then reduces to 
Eq. (2.14) when P(A and B) = 0. Likewise. Fig. 2.5 is a special case of Fig. 2.6 when 
the regions have zero overlap. There was therefore technically no need to introduce 
Eq. (2.14) first. We could have started with Eq. (2.18), which covers all possible 
scenarios, and then showed that it reduces to Eq. (2.14) when the events are exclusive. 
But as in Section 2.2.1, it is often better to start with a special case and then work up 
to the more general case. * 


2.2.3 (In)dependence and (non)exclusiveness 

Two events are either independent or dependent, and they are also either exclusive 
or nonexclusive. There are therefore 2-2 = 4 combinations of these characteris¬ 
tics. Let’s see which combinations are possible. You’ll need to read this section 
very slowly if you want to keep everything straight. This discussion is given for 
curiosity’s sake only, in case you were wondering how the dependent/independent 
characteristic relates to the exclusive/nonexclusive characteristic. There is no need 
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to memorize the results below. Instead, you should think about each situation indi¬ 
vidually and determine its properties from scratch. 

• Exclusive and Independent: This combination isn’t possible. If two events 
are independent, then their probabilities are independent of each other, which 
means that there is a nonzero probability (namely, the product of the individ¬ 
ual probabilities) that both events happens. Therefore, they cannot be exclu¬ 
sive. 

Said in another way, if two events A and B are exclusive, then the probability 
of B given A is zero. But if they are also independent, then the probability 
of B is independent of what happens with A. So the probability of B must be 
zero, period. Such a B is a very uninteresting event, because it never happens. 

• Exclusive and Dependent: This combination is possible. An example con¬ 
sists of the events 


A = {rolling a 2 on a die}, 

B = {rolling a 5 on the same die}. (2.22) 

Another example consists of A as one event and B = {not A } as the other. 
In both of these examples the events are exclusive, because they can’t both 
happen. Furthermore, the occurrence of one event certainly affects the proba¬ 
bility of the other occurring, in that the probability P(B\A) takes the extreme 
value of zero, due to the exclusive nature of the events. The events are there¬ 
fore quite dependent (in a negative sort of way). In short, if two events are 
exclusive, then they are necessarily also dependent. 

• Nonexclusive and Independent: This combination is possible. An example 
consists of the events 


A = { rolling a 2 on a die}, 

B = {rolling a 5 on another die}. (2.23) 

Another example consists of the events A = {getting a Heads on a coin flip} 
and B - {getting a Heads on another coin flip}. In both of these examples the 
events are clearly independent, because they involve different dice or coins. 
And the events can both happen (a fact that is guaranteed by their indepen¬ 
dence, as mentioned in the “Exclusive and Independent” case above), so they 
are nonexclusive. In short, if two events are independent, then they are neces¬ 
sarily also nonexclusive. This statement is the logical “contrapositive” of the 
corresponding statement in the “Exclusive and Dependent” case above. 

• Nonexclusive and Dependent: This combination is possible. An example 
consists of the events 


A = {rolling a 2 on a die}, 

B = {rolling an even number on the same die}. 


(2.24) 
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Another example consists of picking balls without replacement from a box 
with two red balls and three blue balls, with the events being A = {picking a 
red ball on the first pick} and B = {picking a blue ball on the second pick}. 
In both of these examples the events are dependent, because the occurrence 
of A affects the probability of B. (In the die example. Pdf A) takes on the 
extreme value of 1, which isn’t equal to P(B) = 1/2. Also, P{A\B ) = 1/3, 
which isn’t equal to P(A) = 1/6. Likewise for the box example.) And the 
events can both happen, so they are nonexclusive. 

To sum up, we see that all exclusive events must be dependent, but nonexclusive 
events can be either independent or dependent. Similarly, all independent events 
must be nonexclusive, but dependent events can be either exclusive or nonexclusive. 
These facts are summarized in Table 2.3, which indicates which combinations are 
possible. 


Independent Dependent 


Exclusive 


Nonexclusive 


NO 

YES 

YES 

YES 


Table 2.3: Relations between (in)dependence and (non)exclusiveness. 


2.2.4 Conditional probability 

In Eq. (2.5) we introduced the concept of conditional probability, with P(B\A ) de¬ 
noting the probability that B occurs, given that A occurs. In this section we’ll talk 
more about conditional probabilities. In particular, we’ll show that two probabilities 
that you might naively think are equal are in fact not equal. Consider the following 
example. 

Fig. 2.8 gives a pictorial representation of the probability that a random person’s 
height is greater than 6'3" (6 feet, 3 inches) or less than 6'3", along with the prob¬ 
ability that a random person’s last name begins with Z or not Z. We haven’t tried 
to mimic the exact numbers, but we have indicated that the vast majority of people 
are under 6'3" (this case takes up most of the vertical span of the square), and also 
that the vast majority of people have a last name that doesn’t begin with Z (this case 
takes up most of the horizontal span of the square). We’ll assume that the proba¬ 
bilities involving heights and last-name letters are independent. This independence 
manifests itself in the fact that the horizontal and vertical dividers of the square are 
straight lines (as opposed to, for example, the shifted lines in Fig. 2.4). This inde¬ 
pendence makes things a little easier to visualize, but it isn’t critical in the following 
discussion. 
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not Z Z 


over 6'3" 


under 6'3" 


d 

c 

a 

b 


Figure 2.8: Probability square for independent events (height, and first letter of last name). 


Let’s now look at some conditional probabilities. Let the areas of the four rect¬ 
angles in Fig. 2.8 be a,b,c,d, as indicated. The area of a region represents the 
probability that a given person is in that region. Let Z stand for “having a last name 
that begins with Z,” and let U stand for “being under 6'3" in height.” 

Consider the conditional probabilities P(Z|U) and P(U|Z). f’(Z|U) deals with 
the subset of cases where we know that U occurs. These cases are associated with 
the area below the horizontal dividing line in the figure. So P(Z|U) equals the 
fraction of the area below the horizontal line (which is a + b) that is also to the right 
of the vertical line (which is b). This fraction b/(b + a) is very small. 

In contrast, P(U|Z) deals with the subset of cases where we know that Z occurs. 
These cases are associated with the area to the right of the vertical dividing line in 
the figure. So P(U|Z) equals the fraction of the area to the right of the vertical line 
(which is b + c) that is also below the horizontal line (which is b). This fraction 
b/(b + c ) is very close to 1. To sum up, we have 

P(Z|U) = * 0, 

b + a 

P( U|Z) = * 1. (2.25) 

b + c 

We see that P(Z|U) is not equal to P(U|Z). If we were dealing with a situation 
where a — c, then these conditional probabilities would be equal. But that is an 
exception. In general, the two probabilities are not equal. 

If you’re too hasty in your thinking, you might say something like, “Since U 
and Z are independent, one doesn’t affect the other, so the conditional probabili¬ 
ties should be the same.” This conclusion is incorrect. The correct statement is, 
“Since U and Z are independent, one doesn’t affect the other, so the conditional 
probabilities are equal to the corresponding unconditional probabilities.” That is, 
P(Z|U) = P( Z) and P(U|Z) = P(U). But P(Z) and P(U) are vastly different, with 
the former being approximately zero, and the latter being approximately 1. 

In order to make it obvious that the two conditional probabilities P(A\B) and 
P(B\A) aren’t equal in general, we picked an example where the various probabil¬ 
ities were all either close to zero or close to 1. We did this solely for pedagogical 
purposes; the non-equality of the conditional probabilities holds in general (except 
in the a — c case). Another extreme example that makes it clear that the two con¬ 
ditional probabilities are different is: The probability that a living thing is human. 
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given that it has a brain, is very small; but the probability that a living thing has a 
brain, given that it is human, is 1. 

The takeaway lesson here is that when thinking about the conditional probability 
P(A\B ), the order of A and B is critical. Great confusion can arise if one forgets this 
fact. The classic example of this confusion is the “Prosecutor’s fallacy,” discussed 
below in Section 2.4.3. That example should convince you that a lack of basic 
knowledge of probability can have significant and possibly tragic consequences in 
real life. 


2.3 Examples 


Let’s now do some examples. Introductory probability problems generally fall into 
a few main categories, so we’ve divided the examples into the various subsections 
below. There is no better way to learn how to solve probability problems (or any 
kind of problem, for that matter) than to just sit down and do a bunch of them, so 
we’ve presented quite a few. 

If the statement of a given problem lists out the specific probabilities of the 
possible outcomes, then the rules in Section 2.2 are often called for. However, in 
many problems you encounter, you’ll be calculating probabilities from scratch (by 
counting things), so the rules in Section 2.2 generally don’t come into play. You 
simply have to do lots of counting. This will become clear in the examples below. 
For all of these, be sure to try the problem for a few minutes on your own before 
looking at the solution. 

In virtually all of these examples, we’ll be dealing with situations in which the 
various possible outcomes are equally likely. For example, we’ll be tossing coins, 
picking cards, forming committees, forming permutations, etc. We will therefore 
be making copious use of Eq. (2.1), 


number of desired outcomes 

P = - 

total number of possible outcomes 


(for equally likely outcomes) (2.26) 


We won’t, however, bother to specifically state each time that the different outcomes 
are all equally likely. Just remember that they are, and that this fact is necessary for 
Eq. (2.1) to be valid. 

Before getting into the examples, let’s start off with a problem-solving strategy 
that comes in very handy in certain situations. 


2.3.1 The art of “not” 

There are many setups in which the easiest way to calculate the probability of a 
given event A is not to calculate it directly, but rather to calculate the probability of 
“not A” and then subtract the result from 1. This yields P(A) because we know from 
Eq. (2.17) that P(A) — 1 - /Tuot A). The event “not A” is called the complement 
of the event A. 

The most common situation of this type involves a question along the lines of, 
“What is the probability of obtaining at least one of such-and-such?” The “at least” 
part appears to make things difficult, because it could mean one, or two, or three, etc. 
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It would be at best rather messy, and at worst completely intractable, to calculate 
the individual probabilities of all the different numbers and then add them up to 
obtain the answer. The “at least one” question is very different from the “exactly 
one” question. 

The key point that simplifies things is that the only way to not get at least one 
of something is to get exactly zero of it. This means that we can just calculate the 
probability of getting zero, and then subtract the result from 1. We therefore need to 
calculate only one probability, instead of a potentially large number of probabilities. 


Example (At least one 6): Three dice are rolled. What is the probability of obtaining 
at least one 6? 

Solution: We’ll find the probability of obtaining zero 6’s and then subtract the result 
from 1. In order to obtain zero 6’s, we must obtain something other than a 6 on the 
first die (which happens with 5/6 probability), and likewise on the second die (5/6 
probability again), and likewise on the third die (5/6 probability again). These are 
independent events, so the probability of obtaining zero 6’s equals (5/6) 3 = 125/216. 
The probability of obtaining at least one 6 is therefore 1 - (5/6) 3 = 91/216, which is 
about 42%. 

If you want to solve this problem the long way, you can add up the probabilities of 
obtaining exactly one, two, or three 6’s. This is the task of Problem 2.11. 

Remark: Beware of the following incorrect reasoning for this problem: There is 
a 1/6 chance of obtaining a 6 on each of the three rolls. The total probability of 
obtaining at least one 6 therefore seems like it should be 3 • (1/6) = 1/2. This is 
incorrect because we’re trying to find the probability of “a 6 on the first roll” or “a 6 
on the second roll” or “a 6 on the third roll.” (This “or” combination is equivalent to 
obtaining at least one 6. Remember that when we write “or,” we mean the “inclusive 
or.”) But from Eq. (2.14) (or its simple extension to three events) it is appropriate to 
add up the individual probabilities only if the events are exclusive. For nonexclusive 
events, we must subtract off the “overlap” probabilities, as we did in Eq. (2.18); see 
Problem 2.2(d) for the case of three events. The above three events (rolling 6’s) are 
clearly nonexclusive, because it is possible to obtain a 6 on, say, both the first roll and 
the second roll. We have therefore double (or triple) counted many of the outcomes, 
and this is why the incorrect answer of 1/2 is larger than the correct answer of 91/216. 
The task of Problem 2.12 is to solve this problem by using the result in Problem 2.2(d) 
to keep track of all the double (and triple) counting. 

Another way of seeing why the “3 • (1/6) = 1/2” reasoning can’t be correct is that it 
would imply that if we had, say, 12 dice, then the probability of obtaining at least one 
6 would be 12 • (1/6) = 2. But probabilities larger than 1 are nonsensical. * 


2.3.2 Picking seats 

Situations often come up where we need to assign various things to various spots. 
We’ll generally talk about assigning people to seats. There are two common ways to 
solve problems of this sort: (1) You can count up the number of desired outcomes. 
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along with the total number of outcomes, and then take their ratio via Eq. (2.1), 
or (2) you can imagine assigning the seats one at a time, finding the probability of 
success at each stage, and using the rules in Section 2.2, or their extensions to more 
than two events. It’s personal preference which method you use. But it never hurts 
to solve a problem both ways, of course, because that allows you to double check 
your answer. 


Example 1 (Middle in the middle): Three chairs are arranged in a line, and three 
people randomly take seats. What is the probability that the person with the middle 
height ends up in the middle seat? 

First solution: Let the people be labeled from tallest to shortest as 1, 2, and 3. Then 
the 3! = 6 possible orderings are 

123 132 213 231 312 321 (2.27) 

We see that two of these (12 3 and 3 21) have the middle-height person in the middle 
seat. So the probability is 2/6 = 1/3. 


Second solution: Imagine assigning the people randomly to the seats, and let’s 
assign the middle-height person first, which we are free to do. There is a 1/3 chance 
that this person ends up in the middle seat (or any other seat, for that matter). So 1/3 
is the desired answer. Nothing fancy going on here. 


Third solution: If you want to assign the tallest person first, then there is a 1/3 chance 
that she ends up in the middle seat, in which case there is zero chance that the middle- 
height person ends up there. There is a 2/3 chance that the tallest person doesn’t end 
up in the middle seat, in which case there is a 1/2 chance that the middle-height person 
ends up there (because there are two seats remaining, and one yields success). So the 
total probability that the middle-height person ends up in the middle seat is 


1 

3 


■0 + 


2 1 
3 ' 2 


1 

3 ' 


(2.28) 


Remark: The preceding equation technically comes from one application of Eq. (2.14) 
and two applications of Eq. (2.5). If we let T stand for tallest and M stand for middle- 
height, and if we use the notation T^d to mean that the tallest person is in the middle 
seat, etc., then we can write 


BfA/mid) — 7 ) (7' m id ar >d + E{T noi m jd and M m id) 

= f ) (7'mid) " /’f'V/mid7 m id) + /’(?not mid) ’ ^(^midlTnot mid) 

1 2 11 

= -• 0 +-= -. 

3 3 2 3 


(2.29) 


Eq. (2.14) is relevant in the first line because the two events ‘T m ;d and M m jd” and 
“T not m id and d” are exclusive events, since T can’t be both in the middle seat and 
not in the middle seat. 

However, when solving problems of this kind, although it is sometimes helpful to 
explicitly write down the application of Eqs. (2.14) and (2.5) as we just did, this often 
isn’t necessary. It is usually quicker to imagine a large number of trials and then 
calculate the number of these trials that yield success. For example, if we do 600 trials 
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of the present setup, then (1/3) • 600 = 200 of them (on average) have T in the middle 
seat, in which case failure is guaranteed. Of the other (2/3) • 600 = 400 trials where T 
isn't in the middle seat, half of them (which is (1 /2) • 400 = 200) have M in the middle 
seat. So the desired probability is 200/600 = 1/3. In addition to being more intuitive, 
this method is safer than just plugging things into formulas (although it’s really the 
same reasoning in the end). * 


Example 2 (Order of height in a line): Five chairs are arranged in a line, and five peo¬ 
ple randomly take seats. What is the probability that they end up in order of decreasing 
height, from left to right? 

First solution: There are 5! = 120 possible arrangements of the five people in the 
seats. But there is only one arrangement where they end up in order of decreasing 
height. So the probability is 1/120. 

Second solution: If we randomly assign the tallest person to a seat, there is a 1/5 
chance that she ends up in the leftmost seat. Assuming that she ends up there, there is a 
1/4 chance that the second tallest person ends up in the second leftmost seat (because 
there are only four seats left). Likewise, the chances that the other people end up 
where we want them are 1/3, then 1/2, and then 1/1. (If the first four people end up 
in the desired seats, then the shortest person is guaranteed to end up in the rightmost 
seat.) So the probability is 1/5 - 1/4 - 1/3 - 1/2 - 1/1 = 1/120. 

The product of these five probabilities comes from the extension of Eq. (2.5) to five 
events (see Problem 2.2(b) for the three-event case), which takes the form, 

P(A and B and C and D and E ) = P{A) ■ P(B\A ) • P{C\A and B) 

■ P(D\A and B and C) (2.30) 

• P(E\A and B and C and D). 

We will use similar extensions repeatedly in the examples below. 

Alternatively, instead of assigning people to seats, we can assign seats to people. That 
is, we can assign the first seat to one of the five people, and then the second seat to 
one of the remaining four people, and so on. Multiplying the probabilities of success 
at each stage gives the same product as above, 1/5 ■ 1/4 ■ 1/3 ■ 1/2 • 1/1 = 1/120. 


Example 3 (Order of height in a circle): Five chairs are arranged in a circle, and 
five people randomly take seats. What is the probability that they end up in order 
of decreasing height, going clockwise? The decreasing sequence of people can start 
anywhere in the circle. That is, it doesn’t matter which seat has the tallest person. 

First solution: As in the previous example, there are 5! = 120 possible arrangements 
of the five people in the seats. But now there are five arrangements where they end up 
in order of decreasing height. This is true because the tallest person can take five pos¬ 
sible seats, and once her seat is picked, the positions of the other people are uniquely 
determined if they are to end up in order of decreasing height. The probability is 
therefore 5/120 = 1/24. 
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Second solution: If we randomly assign the tallest person to a seat, it doesn't matter 
where she ends up, because all five seats in the circle are equivalent. But given that 
she ends up in a certain seat, the second tallest person needs to end up in the seat next 
to her in the clockwise direction. This happens with probability 1/4. Likewise, the 
third tallest person has a 1/3 chance of ending up in the next seat in the clockwise 
direction. And then 1/2 for the fourth tallest person, and 1/1 for the shortest person. 
The probability is therefore 1/4 • 1/3 • 1/2 • 1/1 = 1/24. 

If you want, you can preface this product with a “5/5” for the tallest person, because 
there are five possible seats she can take (this is the denominator), and there are also 
five successful seats she can take (this is the numerator) because it doesn’t matter 
where she ends up. 


Example 4 (Three girls and three boys): Six chairs are arranged in a line, and three 
girls and three boys randomly pick seats. What is the probability that the three girls 
end up in the three leftmost seats? 

First solution: The total number of possible seat arrangements is 6 ! = 720. There are 
3! = 6 different ways that the three girls can be arranged in the three leftmost seats, 
and 3! = 6 different ways that the three boys can be arranged in the other three (the 
rightmost) seats. So the total number of successful arrangements is 3! • 3! = 36. The 
desired probability is therefore 3!3!/6! = 36/720 = 1/20. 

Second solution: Let’s assume that the girls pick their seats first, one at a time. The 
first girl has a 3/6 chance of picking one of the three leftmost seats. Then, given that 
she is successful, the second girl has a 2/5 chance of success, because only two of 
the remaining five seats are among the left three. And finally, given that she too is 
successful, the third girl has a 1/4 chance of success, because only one of the remain¬ 
ing four seats is among the left three. If all three girls are successful, then all three 
boys are guaranteed to end up in the three rightmost seats. The desired probability is 
therefore 3/6 • 2/5 • 1/4 = 1/20. 

Third solution: The 3131/6! result in the first solution looks suspiciously like the 
inverse of the binomial coefficient = 6 !/3 !3!. This suggests that there is another 
way to solve the problem. And indeed, imagine randomly choosing three of the six 
seats for the girls. There are ways to do this, all equally likely. Only one of 
these is the successful choice of the three leftmost seats, so the desired probability is 
1 /( 3 ) = 3131/6! = 1/20. 


2.3.3 Socks in a drawer 

Picking colored socks from a drawer is a classic probabilistic setup. As usual, if 
you want to deal with such setups by counting things, then subgroups and binomial 
coefficients will come into play. If, however, you want to imagine picking the socks 
in succession, then you’ll end up multiplying various probabilities and using the 
rules in Section 2.2. 
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Example 1 (Two blue and two red): A drawer contains two blue socks and two 
red socks. If you randomly pick two socks, what is the probability that you obtain a 
matching pair? 

First solution: There are = 6 possible pairs you can pick. Of these, two are 
matching pairs (one blue pair, one red pair). So the probability is 2/6 = 1/3. If you 
want to list out all the pairs, they are (with 1 and 2 being the blue socks, and 3 and 4 
being the red socks): 

1,2 1,3 1,4 2,3 2,4 3.4 (2.31) 

The pairs in bold are the matching pairs. 

Second solution: After you pick the first sock, there is one sock of that color (what¬ 
ever it may be) left in the drawer, and two of the other color. So of the three socks 
left, one gives you a matching pair, and two don’t. The desired probability is therefore 
1/3. See Problem 2.9 for a generalization of this example. 


Example 2 (Four blue and two red): A drawer contains four blue socks and two red 
socks, as shown in Fig. 2.9. If you randomly pick two socks, what is the probability 
that you obtain a matching pair? 



Figure 2.9: A box with four blue socks and two red socks. 


First solution: There are ( 2 ) = 15 possible pairs you can pick. Of these, there are 
( 2 ) = 6 blue pairs and ( 2 ) = 1 red pair. The desired probability is therefore 


M 1 
(D " 15 ' 


(2.32) 


Second solution: There is a 4/6 chance that the first sock you pick is blue. If this 
happens, there is a 3/5 chance that the second sock you pick is also blue (because 
there are three blue and two red socks left in the drawer). Similarly, there is a 2/6 
chance that the first sock you pick is red. If this happens, there is a 1/5 chance that the 
second sock you pick is also red (because there are one red and four blue socks left in 
the drawer). The probability that the socks match is therefore 

4 3 2 1 _ 14 _ 7 
6 5 + 6 5~30~l5’ 


(2.33) 
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If you want to explicitly justify the sum on the lefthand side here, it comes from the 
sum on the righthand side of the following relation (with B\ standing for a blue sock 
on the first pick, etc.): 

P(B l and B 2 ) + P(Ri and R 2 ) = P(B l )-P(B 2 \Bi) + P(R l ) P(R 2 \R 1 ). (2.34) 

However, equations like this can be a bit intimidating, so it’s often better to think 
in terms of a large set of trials, as mentioned in the remark in the first example in 
Section 2.3.2. 


2.3.4 Coins and dice 

There is never a shortage of probability examples involving dice rolls or coin flips. 


Example 1 (One of each number): Six dice are rolled. What is the probability of 
obtaining exactly one of each of the numbers 1 through 6? 


First solution: The total number of possible (ordered) outcomes for what all six dice 
show is 6 6 , because there are six possibilities for each die. How many outcomes are 
there that have each number appearing once? This is simply the question of how many 
permutations there are of six numbers, because we need all six numbers to appear, but 
it doesn’t matter in what order. There are 6! permutations, so the desired probability 
is 


6 ! 

6 6 


5 

324 


1.5%. 


(2.35) 


Second solution: Let's imagine rolling six dice in succession, with the goal of having 
each number appear once. On the first roll, we get what we get, and there’s no way to 
fail. So the probability of success on the first roll is 1. However, on the second roll, 
we don’t want to get a repeat of the number that appeared on the first roll (whatever 
that number happened to be). Since there are five “good” options left, the probability 
of success on the second roll is 5/6. On the third roll, we don’t want to get a repeat 
of either of the numbers that appeared on the first and second rolls, so the probability 
of success on the third roll (given success on the first two rolls) is 4/6. Likewise, the 
fourth roll has a 3/6 chance of success, the fifth has 2/6, and the sixth has 1/6. The 
probability of complete success all the way through is therefore 


1 5 4 3 2 1 = 

6 ' 6 ’ 6 ' 6 ' 6 324 ’ 


(2.36) 


in agreement with the first solution. Note that if we write the initial 1 here as 6/6, then 
this expression becomes 6!/6 6 , which is the fraction that appears in Eq. (2.35). 


Example 2 (Three pairs): Six dice are rolled. What is the probability of getting three 
pairs, that is, three different numbers that each appear twice? 
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Solution: We'll count the total number of (ordered) ways to get three pairs, and then 
we'll divide that by the total number of possible (ordered) outcomes for the six rolls, 
which is 6 6 . 

There are two steps in the counting. First, how many different ways can we pick the 
three different numbers that show up? We need to pick three numbers from six, so the 
number of ways is ( 3 ) = 20 . 

Second, given the three numbers that show up, how many different (ordered) ways 
can two of each appear on the dice? Let’s says the numbers are 1, 2, and 3. We 
can imagine plopping two of each of these numbers down on six blank spots (which 
represent the six dice) on a piece of paper. There are = 15 ways to pick where the 

two 1 's go. And then there are (ij) = 6 ways to pick where the two 2’s go in the four 

remaining spots. And then finally there is ( 2 ) = 1 way 1° pick where the two 3’s go in 
the two remaining spots. 

The total number of ways to get three pairs is therefore (3) • (2) ' (2) ' (2)- So the 
probability of getting three pairs is 


P = 


6 6 


20 ■ 15 • 6 ■ 1 
6® 


25 

648 


3.9%. 


(2.37) 


If you try to solve this problem in a manner analogous to the second solution in the 
previous example (that is, by multiplying probabilities for the successive rolls), then 
things get a bit messy because there are many different scenarios that lead to three 
pairs. 


Example 3 (Five coin flips): A coin is flipped five times. Calculate the probabilities 
of getting the various possible numbers of Heads (0 through 5). 


Solution: We'll count the number of (ordered) ways to get the different numbers of 
Heads, and then we'll divide that by the total number of possible (ordered) outcomes 
for the five flips, which is 2 5 . 

There is only = 1 way to get zero Heads, namely TTTTT. There are (j) = 5 ways 
to get one Heads (such as HTTTT), because there are ways to choose the one coin 

that shows Heads. There are ( 2 ) = 10 ways to get two Heads, because there are ( 2 ) 
ways to choose the two coins that show Heads. And so on. The various probabilities 
are therefore 


F(0) = ^, 

P(3) = ^, 


- 4 ’ 

~,.§ 


P(2) = 


P(5) = 


a 

2 5 

a 

2 5 


Plugging in the values of the binomial coefficients gives 


p(o) 4 

P(3) = — , 
w 32 


P( 1) = — , 
32 

P( 4) = — , 
32 


P(2) = — . 
v 32 ' 

P( 5) = — . 
32 


(2.38) 


(2.39) 


The sum of all these probabilities correctly equals 1. The physical reason for this is 
that the number of Heads must be something , which means that the sum of all the 




2.3. Examples 


83 


probabilities must be 1. (This holds for any number of flips, of course, not just 5.) 
The mathematical reason is that the sum of the binomial coefficients (the numerators 
in the above fractions) equals 2 5 (which is the denominator). See Section 1.8.3 for the 
explanation of this. 


2.3.5 Cards 

We already did a lot of card counting in Chapter 1 (particularly in Problem 1.10), 
and some of those results will be applicable here. As we have mentioned a number 
of times, exercises in probability are often just exercises in counting. There is ef¬ 
fectively an endless number of probability questions we can ask about cards. In the 
following examples, we will always assume a standard 52-card deck. 


Example 1 (Royal flush from seven cards): A few variations of poker involve being 
dealt seven cards (in one way or another) and forming the best five-card hand that can 
be made from these seven cards. What is the probability of being able to form a Royal 
flush in this setup? A Royal flush consists of 10, J, Q. K, A, all from the same suit. 

Solution: The total number of possible seven-card hands is = 133,784,560. The 
number of seven-card hands that contain a Royal flush is 4 ■ = 4,324, because 

there are four ways to choose the five Royal flush cards (the four suits), and then ( 47 ) 
ways to choose the other two cards from the remaining 52 - 5 = 47 cards in the deck. 
The probability is therefore 


4 ’( 4 2 7 )_ 4,324 

( 5 7 2 ) ” 133,784,560 


0.0032%. 


(2.40) 


This is larger than the result for five-card hands. In that case, only four of the ( 5 5 2 ) = 
2,598,960 hands are Royal flushes, so the probability is 4/2,598,960 « 0.00015%, 
which is about 20 times smaller than 0.0032%. As an exercise, you can show that the 
ratio happens to be exactly 21. 


Example 2 (Suit full house): In a five-card poker hand, what is the probability of 
getting a “full house” of suits, that is, three cards of one suit and two of another? (This 
isn't an actual poker hand worth anything, but that won’t stop us from calculating the 
probability!) How does your answer compare with the probability of getting an actual 
full house, that is, three cards of one value and two of another? Feel free to use the 
result from part (a) of Problem 1.10. 

Solution: There are four ways to choose the suit that appears three times, and (j 3 ) = 
286 ways to choose the specific three cards from the 13 of this suit. And then there 
are three ways to choose the suit that appears twice from the remaining three suits, 
and (j 3 ) =78 ways to choose the specific two cards from the 13 of this suit. The total 
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number of suit-full-house hands is therefore 4 ■ ( 3 3 ) • 3 • = 267,696. Since there 

is a total of ( 5 5 2 ) possible hands, the desired probability is 


4-1 

(13' 
l 3 > 

|-3-( 

( 2 ) 

1 _ 267,696 

( 

(?) 

1 2,598,960 


10.3%. 


(2.41) 


From part (a) of Problem 1.10, the total number of actual full-house hands is 3,744, 
which yields a probability of 3,744/2,598,960 ~ 0.14%. It is therefore much more 
likely (by a factor of about 70) to get a full house of suits than an actual full house of 
values. (You can show that the exact ratio is 71.5.) This makes intuitive sense; there 
are more values than suits (13 compared with four), so it is harder to have all five cards 
involve only two values as opposed to only two suits. 


Example 3 (Only two suits): In a five-card poker hand, what is the probability of 
having all of the cards be members of at most two suits? (A single suit falls into this 
category.) The suit full house in the previous example is a special case of “at most two 
suits.” This problem is a little tricky, at least if you solve it a certain way; be careful 
about double counting some of the hands! 

First solution: If two suits appear, then there are = 6 ways to pick them. For a 
given choice of two suits, there are ( 2 5 6 ) ways to pick the five cards from the 2 ■ 13 = 26 

cards of these two suits. It therefore seems like there should be ■ ( 2 5 6 ) = 394,680 
different hands that consist of cards from at most two suits. 

However, this isn’t correct, because we double (or actually triple) counted the hands 
that involve only one suit (the flushes). For example, if all hve cards are hearts, then we 
counted such a hand in the heart/diamond set of p 5 6 ) hands, and also in the heart/spade 
set, and also in the heart/club set. We counted it three times when we should have 
counted it only once. Since there are ( 3 3 ) hands that are heart flushes, we have in¬ 
cluded an extra 2 • (* 5 3 ) hands, so we need to subtract these from our total. Likewise 
for the diamond, spade, and club flushes. The total number of hands that involve at 
most two suits is therefore 


- 4 • 2 • 


394,680- 10,296 = 384,384. 


The desired probability is then 


©(f) 

1 - ■ 

8 -| 

(?) 

1 _ 384,384 

1 

(?) 

| 2,598,960 


14.8%. 


(2.42) 


(2.43) 


This is larger than the result in Eq. (2.41), as it should be, because suit full houses are 
a subset of the hands that involve at most two suits. 

Second solution: There are three general ways that we can have at most two suits: 
(1) all five cards can be of the same suit (a flush), (2) four cards can be of one suit, and 
one card of another, or (3) three cards can be of one suit, and two cards of another; this 
is the suit full house from the previous example. We will denote these types of hands 
by (5,0), (4,1), and (3,2), respectively. How many hands of each type are there? 
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There are 4 ■ = 5,148 hands of the (5,0) type, because there are ways to pick 

five cards from the 13 cards of a given suit, and there are four suits. From the previous 
example, there are 4 ■ (j 3 ) • 3 • ( 2 ) = 267,696 hands of the (3,2) type. To figure out 
the number of hands of the (4,1) type, we can use exactly the same kind of reasoning 
as in the previous example. This gives 4 • ( 4 3 ) • 3 • ( 13 j = 111,540 hands. Adding up 
these three results gives the total number of “at most two suits” hands as 



= 5,148 + 111,540 + 267,696 

= 384,384, (2.44) 

in agreement with the first solution. (The repetition of the “384” here is due in part to 
the factors of 13 and 11 in all of the terms in the first line of Eq. (2.44). These numbers 
are factors of 1001.) The hands of the (3,2) type account for about 2/3 of the total, 
consistent with the fact that the 10.3% result in Eq. (2.41) is about 2/3 of the 14.8% 
result in Eq. (2.43). 


2.4 Four classic problems 

Let’s now look at four classic probability problems. No book on probability would 
be complete without a discussion of the “Birthday Problem” and the “Game-Shown 
Problem.” Additionally, the “Prosecutor’s Fallacy” and the “Boy/Girl Problem” are 
two other classics that are instructive to study in detail. All four of these problems 
have answers that might seem counterintuitive at first, but they eventually make 
sense if you think about them long enough! 

After reading the statement of each problem, be sure to try solving it on your 
own before looking at the solution. If you can’t solve it on your first try, set it aside 
and come back to it later. There’s no hurry; the problem will still be there. There 
are only so many classic problems like these, so don’t waste them. If you look at 
a solution too soon, the opportunity to solve it is gone, and it’s never coming back. 
If you do eventually need to look at the solution, cover it up with a piece of paper 
and read one line at a time, to get a hint. That way, you can still (mostly) solve it on 
your own. 


2.4.1 The Birthday Problem 

We’ll present the Birthday Problem first. Aside from being a very interesting prob¬ 
lem, its unexpected result allows you to take advantage of unsuspecting people and 
win money on bets at parties (as long as they’re large enough parties, as we’ll see!). 

Problem: How many people need to be in a room in order for there to be a greater 
than 1 /2 probability that at least two of them have the same birthday? By “same 
birthday” we mean the same day of the year; the year may differ. Ignore leap years. 
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(At this point, as with all of the problems in this section, don’t read any further until 
you’ve either solved the problem or thought hard about it for a long time.) 

Solution: If there was ever a problem that called for the “art of not” strategy in 
Section 2.3.1, this is it. There are many different ways for there to be at least 
one common birthday (one pair, two pairs, one triple, etc.), and it is completely 
intractable to add up all of these individual probabilities. It is much easier (and even 
with the italics, this is a vast understatement) to calculate the probability that there 
isn ’t a common birthday, and then subtract this from 1 to obtain the probability that 
there is at least one common birthday. 

The calculation of the probability that there isn’t a common birthday proceeds 
as follows. Let there be n people in the room. We can imagine taking them one at a 
time and randomly plopping their names down on a calendar, with the (present) goal 
being that there are no common birthdays. The first name can go anywhere. But 
when we plop down the second name, there are only 364 “good” days left, because 
we don’t want the day to coincide with the first name’s day. The probability of suc¬ 
cess for the second name is therefore 364/365. Then, when we plop down the third 
name, there are only 363 “good” days left (assuming that the first two people have 
different birthdays), because we don’t want the day to coincide with either of the 
other two days. The probability of success for the third name is therefore 363/365. 
Similarly, when we plop down the fourth name, there are only 362 “good” days left 
(assuming that the first three people have different birthdays). The probability of 
success for the fourth name is therefore 362/365. And so on. 

If there are n people in the room, the probability that all n birthdays are dis¬ 
tinct (that is, there isn 7 a common birthday among any of the people; hence the 
superscript “no” below) therefore equals 

364 363 362 361 365 - (» - 1) 

" 365 ' 365 ' 365 ' 365 ' " ' 365 

If you want, you can write the initial 1 here as 365/365, to make things look nicer. 
Note that the last term involves (n - 1) and not n, because (n - 1) is the number 
of names that have already been plopped down. As a double check that this (n — 
1) is correct, it works for small numbers like n — 2 and 3. You should always 
perform a simple check like this whenever you write down any expression involving 
a parameter such as n. 

We now just have to multiply out the product in Eq. (2.45) to the point where it 
becomes smaller than 1 /2, so that the probability that there is a common birthday is 
larger than 1/2. With a calculator, this is tedious, but not horribly painful. We find 
that = 0.524 and P™’ = 0.493. If Pj, es is the probability that there is a common 
birthday among n people, then / J ;, cs = 1 - P"‘\ so P^ s = 0.476 and Py^ = 0.507. 
Since our original goal was to have Pj, es >1/2 (or equivalently P"° < 1 /2), we see 
that there must be at least 23 people in a room in order for there to be a greater than 
50% chance that at least two of them have the same birthday. The probability in the 
n — 23 case is 50.7%. 

The task of Problem 2.14 is to calculate the probability that among 23 people, 
exactly two of them have a common birthday. That is, there aren’t two different 
pairs with common birthdays, or a triple with the same birthday, etc. 
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Remark: The n = 23 answer to our problem is much smaller than most people would 
expect. As mentioned above, it therefore provides a nice betting opportunity. For n = 30, 
the probability of a common birthday increases to 70.6%, and most people would still find 
it hard to believe that among 30 people, there are probably two who have the same birthday. 
Table 2.4 lists various values of n and the probabilities, = 1 - P” 0 , that at least two 
people have a common birthday. 


n 

10 

20 

23 

30 

50 

60 

70 

100 

p yes 
r n 

11.7% 

41.1% 

50.7% 

70.6% 

97.0% 

99.4% 

99.92% 

99.99997% 


Table 2.4: Probability of a common birthday among n people. 

Even for n = 50, most people would probably be happy to bet, at even odds, that no two 
people have the same birthday. But you’ll win the bet 97% of the time. 

One reason why many people can’t believe the n = 23 result is that they’re asking them¬ 
selves a different question, namely, “How many people (in addition to me) need to be present 
in order for there to be at least a 1 /2 chance that someone else has my birthday?” The answer 
to this question is indeed much larger than 23. The probability that no one out of n people has 
a birthday on a given day is simply (364/365)", because each person has a 364/365 chance 
of not having that particular birthday. For n = 252, this is just over 1/2. And for n = 253, 
it is just under 1/2; it equals 0.4995. Therefore, you need to come across 253 other people 
in order for the probability to be greater than 1 /2 that at least one of them does have your 
birthday (or any other particular birthday). See Problem 2.16 for further discussion of this. * 


2.4.2 The Game-Show Problem 

We’ll now discuss the Game-Show Problem. In addition to having a variety of 
common incorrect solutions, this problem also also a long history of people arguing 
vehemently in favor of those incorrect solutions. 

Problem: A game-show host offers you the choice of three doors. Behind one 
of these doors is the grand prize, and behind the other two are goats. The host 
(who knows what is behind each of the doors) announces that after you select a 
door (without opening it), he will open one of the other two doors and purposefully 
reveal a goat. You select a door. The host then opens one of the other doors and 
reveals the promised goat. He then offers you the chance to switch your choice to 
the remaining door. To maximize the probability of winning the grand prize, should 
you switch or not? Or does it not matter? 

Solution: We’ll present three solutions, one right and two wrong. You should 
decide which one you think is correct before reading beyond the third solution. 
Cover up the page after the third solution with a piece of paper, so that you don’t 
inadvertently see which one is correct. 

• Reasoning 1: Once the host reveals a goat, the prize must be behind one of 
the two remaining doors. Since the prize was randomly located to begin with, 
there must be equal chances that the prize is behind each of the two remaining 
doors. The probabilities are therefore both 1/2, so it doesn’t matter if you 
switch. 
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If you want, you can imagine a friend (who is aware of the whole procedure 
of the host announcing that he will open a door and reveal a goat) entering the 
room after the host opens the door. This person sees two identical unopened 
doors (he doesn’t know which one you initially picked) and a goat. So for him 
there must be a 1/2 chance that the prize is behind each unopened door. The 
probabilities for you and your friend can’t be any different, so you also say 
that each unopened door has a 1 /2 chance of containing the prize. It therefore 
doesn’t matter if you switch. 

• Reasoning 2: There is initially a 1/3 chance that the prize is behind any of the 
three doors. So if you don’t switch, your probability of winning is 1/3. No 
actions taken by the host can change the fact that if you play a large number 
n of these games, then (roughly) nj 3 of them will have the prize behind the 
door you initially pick. 

Likewise, if you switch to the other unopened door, there is a 1/3 chance that 
the prize is behind that door. (There is obviously a goat behind at least one 
of the other two doors, so the fact that the host reveals a goat doesn’t tell you 
anything new.) Therefore, since the probability is 1/3 whether or not you 
switch, it doesn’t matter if you switch. 

• Reasoning 3: As in the first paragraph of Reasoning 2, if you don’t switch, 
your probability of winning is 1/3. 

However, if you switch, your probability of winning is greater than 1/3. It 
increases to 2/3. This can be seen as follows. Without loss of generality, 
assume that you pick the first door. (You can repeat the following reasoning 
for the other doors if you wish. It gives the same result.) There are three 
equally likely possibilities for what is behind the three doors: PGG, GPG, and 
GGP, where P denotes the prize and G denotes a goat. If you don’t switch, 
then in only the first of these three cases do you win, so your odds of winning 
are 1/3 (consistent with the first paragraph of Reasoning 2). But if you do 
switch from the first door to the second or third, then in the first case PGG 
you lose, but in the other two cases you win, because the door not opened by 
the host has the prize. (The host has no choice but to reveal the G and leave 
the P unopened.) Therefore, since two out of the three equally likely cases 
yield success if you switch, your probability of winning if you switch is 2/3. 
So you do in fact want to switch. 

Which of these three solutions is correct? Don’t read any further until you’ve firmly 
decided which one you think is right. 

The third solution is correct. The error in the first solution is the statement, 
“there must be equal chances that the prize is behind each of the two remaining 
doors.’’ This is simply not true. The act of revealing a goat breaks the symmetry 
between the two remaining doors, as explained in the third solution. One door is the 
one you initially picked, while the other door is one of the two that you didn’t pick. 
The fact that there are two possibilities doesn’t mean that their probabilities have to 
be equal, of course! 
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The error in the supporting reasoning with your friend (who enters the room after 
the host opens the door) is the following. While it is true that both probabilities are 
1 /2 for your friend, they aren’t both 1 /2 for you. The statement, “the probabilities 
for you and your friend can’t be any different,” is false. You have information that 
your friend doesn’t have; you know which of the two unopened doors is the one you 
initially picked and which is the door that the host chose to leave unopened. (And 
as seen in the third solution, this information yields probabilities of 1 /3 and 2/3.) 
Your friend doesn’t have this critical information. Both doors look the same to him. 
Probabilities can certainly be different for different people. If I flip a coin and peek 
and see a Heads, but I don’t show you, then the probability of a Heads is 1/2 for 
you, but 1 for me. 

The error in the second solution is that the act of revealing a goat does give you 
new information, as we just noted. This information tells you that the prize isn’t 
behind that door, and it also distinguishes between the two remaining unopened 
doors. One is the door you initially picked, while the other is one of the two doors 
that you didn’t initially pick. As seen in the third solution, this information has the 
effect of increasing the probability that the goat is behind the other door. Note that 
another reason why the second solution can’t be correct is that the two probabilities 
of 1 /3 don’t add up to 1. 

To sum up, it should be no surprise that the probabilities are different for the 
switching and non-switching strategies after the host opens a door (the probabilities 
are obviously the same, equal to 1/3, whether or not a switch is made before the host 
opens a door), because the host gave you some of the information he had about the 
locations of things. 

Remarks: 

1. If you still doubt the validity of the third solution, imagine a situation with 1000 doors 
containing one prize and 999 goats. After you pick a door, the host opens 998 other 
doors and reveals 998 goats (and he said beforehand that he was going to do this). In 
this setup, if you don’t switch, your chances of winning are 1/1000. But if you do 
switch, your chances of winning are 999/1000, which can be seen by listing out (or 
imagining listing out) the 1000 cases, as we did with the three PGG, GPG, and GGP 
cases in the third solution. It is clear that the switch should be made, because the only 
case where you lose after you switch is the case where you had initially picked the 
prize, and this happens only 1/1000 of the time. 

In short, a huge amount of information is gained by the revealing of 998 goats. There 
is initially a 999/1000 chance that the prize is somewhere behind the other 999 doors, 
and the host is kindly giving you the information of exactly which door it is (in the 
highly likely event that it is in fact one of the other 999). 

2. The clause in the statement of the problem, “The host announces that after you select 
a door (without opening it), he will open one of the other two doors and purposefully 
reveal a goat,” is crucial. If it is omitted, and it is simply stated that, "The host then 
opens one of the other doors and reveals a goat,” then it is impossible to state a pre¬ 
ferred strategy. If the host doesn’t announce his actions beforehand, then for all you 
know, he always reveals a goat (in which case you should switch, as we saw above). 
Or he randomly opens a door and just happened to pick a goat (in which case it doesn’t 
matter if you switch, as you can show in Problem 2.18). Or he opens a door and reveals 
a goat if and only if your initial door has the prize (in which case you definitely should 
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not switch). Or he could have one procedure on Tuesdays and another on Fridays, 
each of which depends on the color of the socks he’s wearing. And so on. 

3. As mentioned above, this problem is infamous for the intense arguments it lends itself 
to. There’s nothing terrible about getting the wrong answer, nor is there anything 
terrible about not believing the correct answer for a while. But concerning arguments 
that drag on and on, it doesn’t make any sense to argue about this problem for more 
than, say, 20 minutes, because at that point everyone should stop and just play the 
game\ You can play a number of times with the switching strategy, and then a number 
of times with the non-switching strategy. Three coins with a dot on the bottom of 
one of them are all you need . 1 Not only will the actual game yield the correct answer 
(if you play enough times so that things average out), but the patterns that form will 
undoubtedly convince you of the correct reasoning (or reinforce it, if you’re already 
comfortable with it). Arguing endlessly about an experiment, when you can actually 
do the experiment, is as silly as arguing endlessly about what’s behind a door, when 
you can simply open the door. 

4. For completeness, there is one subtlety we should mention here. In the second so¬ 
lution, we stated, "No actions taken by the host can change the fact that if you play 
a large number n of these games, then (roughly) n/3 of them will have the prize be¬ 
hind the door you initially pick.” This part of the reasoning was correct; it was the 
“switching” part of the second solution that was incorrect. After doing Problem 2.18 
(where the host randomly opens a door), you might disagree with the above statement, 
because it will turn out in that problem that the actions taken by the host do affect this 
n/3 result. Flowever, the above statement is still correct for “these games” (the ones 
governed by the original statement of this problem). See the second remark in the 
solution to Problem 2.18 for further discussion. * 


2.4.3 The Prosecutor’s Fallacy 

We now present one of the most classic problems/paradoxes in the subject of proba¬ 
bility. This classic nature is due in no small part to the problem’s critical relevance to 
the real world. After reading the statement of the problem below, you should think 
carefully and settle on an answer before looking at the solution. The discussion of 
conditional probability in Section 2.2.4 gives a hint at the answer. 

Problem: Consider the following scenario. Detectives in a city, say, Boston (whose 
population we will assume to be one million), are working on a crime and have put 
together a description of the perpetrator, based on things such as height, a tattoo, a 
limp, an earing, etc. Let’s assume that only one person in 10,000 fits the description. 
On a routine patrol the next day, police officers see a person fitting the description. 
This person is arrested and brought to trial based solely on the fact that he fits the 
description. 

During the trial, the prosecutor tells the jury that since only one person in 10,000 
fits the description (a true statement), it is highly unlikely (far beyond a reasonable 
doubt) that an innocent person fits the description (again a true statement); it is 

! You actually don’t need three objects. It’s hard to find three exactly identical coins anyway. The 
“host” can simply roll a die, without showing the “contestant” the result. Rolling a 1 or 2 can mean that 
the prize is located behind the first door, a 3 or 4 the second, and a 5 or 6 the third. The game then 
basically involves calling out door numbers. 
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therefore highly unlikely that the defendant is innocent. If you were a member of 
the jury, would you cast a “guilty” vote? If yes, what is your level of confidence? If 
no, what is wrong with the prosecutor’s reasoning? 

Solution: We’ll assume that we are concerned only with people living in Boston. 
There are one million such people, so if one person in 10,000 fits the description, 
this means that there are 100 people in Boston who fit it (one of whom is the perpe¬ 
trator). When the police officers pick up someone fitting the description, this person 
could be any one of these 100 people. So the probability that the defendant in the 
courtroom is the actual perpetrator is only 1/100. In other words, there is a 99% 
chance that the person is innocent. A guilty verdict (based on the given evidence) 
would therefore be a horrible and tragic vote. 

The above (correct) reasoning is fairly cut and dry, but it contradicts the prose¬ 
cutor’s reasoning. The prosecutor’s reasoning must therefore be incorrect. But what 
exactly is wrong with it? It seems quite plausible at every stage. To isolate the flaw 
in the logic, let’s list out the three separate statements the prosecutor made in his 
argument: 

1. Only one person in 10,000 fits the description. 

2. It is highly unlikely (far beyond a reasonable doubt) that an innocent person 
fits the description. 

3. It is therefore highly unlikely that the defendant is innocent. 

As we noted above when we posed the problem, the first two of these statements are 
true. Statement 1 is true by assumption, and Statement 2 is true basically because 
1/10,000 is a small number. Let’s be precise about this and work out the exact 
probability that an innocent person fits the description. Of the one million people 
in Boston, the number who fit the description is (1 /10,000) (10 6 ) = 100. Of these 
100 people, only one is guilty, so 99 are innocent. And the total number of inno¬ 
cent people is 10 6 - 1 = 999,999. The probability that an innocent person fits the 
description is therefore 

innocent and fitting description 99 5 1 

innocent 999,999 10,000 ^ 

As expected, the probability is essentially equal to 1/10,000. 

Now let’s look at the third statement above. This is where the error is. This 
statement is false, because Statement 2 simply does not imply Statement 3. We 
know this because we have already calculated the probability that the defendant is 
innocent, namely 99%. This correct probability of 99% is vastly different from the 
incorrect probability of 1/10,000 that the prosecutor is trying to mislead you with. 
However, even though the correct result of 99% tells us that Statement 3 must be 
false, where exactly is the error? After all, at first glance Statement 3 seems to 
follow from Statement 2. The error is the confusion of conditional probabilities. In 
detail: 
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• Statement 2 deals with the probability of fitting the description, given inno¬ 
cence. The (true) statement is equivalent to, “If a person is innocent, then 
there is a very small probability that he fits the description.” This probability 
is the conditional probability P(D|I), with D for description and I for inno¬ 
cence. 

• Statement 3 deals with the probability of innocence, given that the descrip¬ 
tion is fit. The (false) statement is equivalent to, “If a person (such as the 
defendant) fits the description, then there is a very small probability that he is 
innocent.” This probability is the conditional probability P(I|D). 

These two conditional probabilities are not the same. The error is the assump¬ 
tion (or implication, on the prosecutor’s part) that they are. As we saw above, 
P(D|I) = 99/999,999 » 0.0001, whereas P(I|D) = 0.99. These two probabili¬ 
ties are markedly different. 

Intuitively, P(D|I) is very small because a very small fraction of the population 
(in particular, a very small fraction of the innocent people) fit the description. And 
P(I|D) is very close to 1 because nearly everyone (in particular, nearly everyone 
who fits the description) is innocent. This state of affairs is indicated in Fig. 2.10. 
(This a just a rough figure; the areas aren’t actually in the proper proportions.) The 
large oval represents the 999,999 innocent people, and the small oval represents the 
100 people who fit the description. 



There are three basic types of people in the figure: There are A = 999,900 
innocent people who don’t fit the description, B — 99 innocent people who do 
fit the description, and C = 1 guilty person who fits the description. (The fourth 
possibility - a guilty person who doesn’t fit the description - doesn’t exist.) The 
two conditional probabilities that are relevant in the above discussion are then 


P(D|I) = 7 
F(I|D) = 


B 


B 


99 


innocent B + A 999,999 ’ 
B B _ 99 

fit description B + C 100 


(2.47) 


Both of these probabilities have B in numerator, because B represents the people 
who are innocent and fit the description. But the A in the first denominator is much 
larger than the C in second denominator. Or said in another way, B is a very small 
fraction of the innocent people (the large oval in Fig. 2.10), whereas it is a very large 
fraction of the people who fit the description (the small oval in Fig. 2.10). 
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The prosecutor’s faulty reasoning has been used countless times in actual court 
cases, with tragic consequences. Innocent people have been convicted, and guilty 
people have walked free (the argument can work in that direction too). These conse¬ 
quences can’t be blamed on the jury, of course. It is inevitable that many jurors will 
fail to spot the error in the reasoning. It would be silly to think that the entire pop¬ 
ulation should be familiar with this issue in probability. Nor can the blame be put 
on the attorney making the argument. This person is either (1) overzealous and/or 
incompetent, or (2) entirely within his/her right to knowingly make an invalid argu¬ 
ment (as distasteful as this may seem). In the end, the blame falls on either (1) the 
opposing attorney for failing to rebut the known logical fallacy, or (2) a legal system 
that in some cases doesn’t allow a final rebuttal. 

2.4.4 The Boy/Girl Problem 

The well-known Boy/Girl Problem can be stated in many different ways, with an¬ 
swers that may or may not be the same. Three different formulations are presented 
below, and a fourth is given in Problem 2.19. Assume in all of them that any pro¬ 
cess involved in the scenario is completely random. That is, assume that any child 
is equally likely to be a boy or a girl (even though this isn’t quite true in real life), 
and assume that there is nothing special about the person you’re talking with, and 
assume that there are no correlations between children (as there are with identical 
twins), and so on. 

Problem: 

(a) You bump into a random person on the street who says, “I have two children. 
At least one of them is a boy.” What is the probability that the other child is 
also a boy? 

(b) You bump into a random person on the street who says, “I have two children. 
The older one is a boy.” What is the probability that the other child is also a 
boy? 

(c) You bump into a random person on the street who says, “I have two children, 
one of whom is this boy standing next to me.” What is the probability that the 
other child is also a boy? 

Solution: 

(a) The key to all three of these formulations is to list out the various equally 
likely possibilities for the family’s children, while taking into account only 
the “I have two children” information, and not yet the information about the 
boy. With B for boy and G for girl, the family in the present scenario in part 
(a) can be of four types (at least before the parent gives you information about 
the boy), each with probability 1/4: 


BB 


BG 


GB 


GG 






94 


Chapter 2. Probability 


Ignore the boxes for a moment. In each pair of letters, the first letter stands 
for the older child, and the second letter stands for the younger child. 

Note that there are indeed four equally likely possibilities (BB, BG, GB, GG), 
as opposed to just three equally likely possibilities (BB, BG, GG), because the 
older child has a 50-50 chance of being a boy or a girl, as does the younger 
child. The BG and GB cases each get counted once, just as the HT and TH 
cases each get counted once when flipping two coins, where the four equally 
likely possibilities are HH, HT, TH, TT. 

Under the assumption of general randomness stated in the problem, we are 
assuming that you are equally likely (at least before the parent gives you in¬ 
formation about the boy) to bump into a parent of any one of the above four 
types of two-child families. 

Let us now invoke the information that at least one child is a boy. This infor¬ 
mation tells us that you can’t be talking with a GG parent. The parent must be 
a BB, BG, or GB parent, all equally likely. (They are equally likely, because 
they are all equivalent with regard to the “at least one of them is a boy” state¬ 
ment.) These are the boxed families in the above list. Of these three cases, 
only the BB case has the other child being a boy. The desired probability that 
the other child is a boy is therefore 1/3. 

If don’t trust the reasoning in the preceding paragraph, just imagine perform¬ 
ing many trials of the setup. This is always a good strategy when solving 
probability problems. Imagine that you encounter 1000 random parents of 
two children. You will encounter about 250 of each of the four types of par¬ 
ent. The 250 GG parents have nothing to do with the given setup, so we must 
discard them. Only the other 750 parents (BB, BG, GB) are able to provide 
the given information that at least one child is a boy. Of these 750 parents, 
250 are of the BB type and thereby have a boy as the other child. The desired 
probability is therefore 250/750 = 1/3. 

(b) As in part (a), before the information about the boy is taken into account, 
there are four equally likely possibilities for the children (again ignore the 
boxes for a moment): 


BB 


BG 


GB 


GG 


But once the parent tells you that the older child is a boy, the GB and GG 
cases are ruled out; remember that the first letter in each pair corresponds to 
the older child. So you must be talking with a BB or BG parent, both equally 
likely. Of these two cases, only the BB case has the other child being a boy. 
The desired probability that the other child is a boy is therefore 1/2. 

(c) This version of the problem is a little trickier, because there are now eight 
equally likely possibilities (before the information about the boy is taken into 
account), instead of just four. This is true because for each of the four types of 
families in the above lists, the parent may choose to take either of the children 
for a walk (with equal probabilities, as we are assuming for everything). The 



2.4. Four classic problems 


95 


eight equally likely possibilities are therefore shown in Fig. 2.5 (again ignore 
the boxes for a moment). The bold letter indicates the child you encounter. 


BB 


BB 


BG 


BG 


GB 


GB 


GG 

GG 


Table 2.5: The eight types of families, accounting for the child present. 


Once the parent tells you that one of the children is the boy standing there, 
four of the eight possibilities are ruled out. Only the four boxed pairs in 
Fig. 2.5 (the ones with a bold B) satisfy the condition that the child standing 
there is a boy. Of these four (equally likely) possibilities, two of them have 
the other child being a boy. The desired probability that the other child is a 
boy is therefore 1/2. 

Remarks: 

1. We used the given assumption of general randomness many times in the above solu¬ 
tions. One way to make things nonrandom is to assume that the parent who is out for 
a walk is chosen randomly with equal 1/3 probabilities of being from BB families, 
or GG families, or one-boy-and-one-girl families. This is an artificial construction, 
because it means that a given BG or GB family (which together make up half of all 
two-child families) is less likely to be chosen than a given BB or GG family. This 
violates our assumption of general randomness. In this scenario, you can show that 
the answers to parts (a), (b), and (c) are 1/2, 2/3, and 2/3. 

Another way to make things nonrandom is to assume that in part (c) a girl is always 
chosen to go on the walk if the family has at least one girl. The answer to part (c) is 
then 1, because the only way a boy will be standing there is if both children are boys. 
On the other hand, if we assume that a boy is always chosen to go on the walk if the 
family has at least one boy, then the answer to part (c) is 1/3. This is true because for 
BB, the other child is a boy; and for both BG and GB (for which the boy is always 
chosen to go on the walk), the other child is a girl. Basically, the middle four pairs in 
Table 2.5 will all have a bold B. so they will all be boxed. There are countless ways 
to make things nonrandom, so unless we make an assumption of general randomness, 
there is no way to solve the problem. 

2. Let’s compare the scenarios in parts (a) and (b), to see exactly why the probabilities 
differ. In part (a), the parent’s statement rules out the GG case. The BB. BG. and GB 
cases survive, with the BB families representing 1 /3 of all of the possibilities. If the 
parent then changes the statement, “at least one of them is a boy” to “the older one 
is a boy,” we are now in the realm of part (b). The GB case is now also ruled out (in 
addition to the GG case). So only the BB and BG cases survive, with the BB families 
representing 1/2 of all of the possibilities. This is why the probability jumps from 1/3 
to 1 /2 in going from part (a) to part (b). An additional group of families (GB) is ruled 
out. 

Let’s now compare the scenarios in parts (a) and (c), to see exactly why the proba¬ 
bilities differ. As in the preceding paragraph, the parent’s statement in part (a) rules 
out the GG case. If the parent then makes the additional statement .. and there he 
is over there next to that tree,” we are now in the realm of part (c). Which additional 
families are ruled out? Well, in part (a), you could be talking with a parent in any of 
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the families in Table 2.5 except the two GG entries. So there are six valid possibilities. 
But as soon as the parent adds the “and there he is” comment, the unboxed GB and 
BG entries are ruled out. So a larger fraction of the valid possibilities (now two out of 
four, instead of two out of six) have the other child being a boy. 

3. Having gone through all of the above reasonings and the comparisons of the different 
cases, we should note that there is actually a much quicker way of obtaining the prob¬ 
abilities of 1 /2 in parts (b) and (c). If the parent says that the older child is a boy, or 
that one of the children is the boy standing next to her, then the parent is making a 
statement solely about a particular child (the older one, or the present one). The par¬ 
ent is saying nothing about the other child (the younger one, or the absent one). We 
therefore know nothing about that child. So by our assumption of general random¬ 
ness, the other child is equally likely to be a boy or a girl. This should be contrasted 
with part (a). In that scenario, when the parent says that at least one child is a boy, the 
parent is not making a claim about a specific child, but rather about the collective set 
of the two children together. We are therefore not able to uniquely define the “other 
child" and simply say that the answer is 1/2. The answer depends on both children 
together, and it turns out to be different from 1/2 (namely 1/3). 

4. There is a subtlety in this problem that we should address: How does the parent decide 
what information to give you? A reasonable rule could be that in part (a) the parent 
says, “At least one child is a boy,” if she is able to; otherwise she says, “At least one 
child is a girl.” This is consistent with all of our above reasoning. But consider what 
happens if we tweak the rule so that now the parent says, “At least one child is a girl,” 
if she is able to; otherwise she says, “At least one child is a boy.” In this case, the 
answer to part (a) is 1, because the only parents making the “boy” statement are the 
BB parents. This minor tweak completely changes the problem. 

If you want to avoid this issue, you can rephrase part (a) as: You bump into a random 
person on the street and ask, “Do you have (exactly) two children? If so, is at least one 
of them a boy?” In the cases where the answers to both of these questions are “yes,” 
what is the probability that the other child is also a boy? Alternatively, you can just 
remove the parent and pose the problem as: Consider all two-child families that have 
at least one boy. What is the probability that both children are boys? This phrasing 
isn't as catchy as the original, but it gets rid of the above issue. 

5. In the various lists of types of families in the above solutions, only the boxed types 
were applicable. The unboxed ones didn't satisfy the conditions given in the statement 
of the problem, so we discarded them. This act of discarding the unboxed types is 
equivalent to using the conditional-probability statement in Eq. (2.5), which can be 
rearranged to say 

P(B\A) = P(A P(A) B) ' (2 ' 48) 

For example, in part (a) if we let A = jat least 1 boy) and B = {2 boys), then we 
obtain 

/’((at least 1 boy) and (2 boys)) 

r(a boys.Ka, least 1 boy)) - — —, — - . (2.49, 

The lefthand side of this equation is the probability we’re trying to find. On the right- 
hand side, we can rewrite P((at least 1 boy) and (2 boys)) as just P{2 boys), because 
|2 boys) is a subset of jat least 1 boy). So we have 

/ , \ P(2 boys) 1/4 1 

P (< 2 b ^>"“ 1 b °'» ‘ /.(a, leas, I boy) ' 573 ' 5 ' < 2 - 5 °> 
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The preceding equations might look a bit intimidating, which is why we took a more 
intuitive route in the above solution to part (a), where we imagined doing 1000 trials 
and then discarding the 250 GG families. Discarding these families accomplishes 
the same thing as having the P(at least 1 boy) term in the denominator in Eq. (2.50); 
namely, they both signify that we are concerned only with families that have at least 
one boy. This remark leads us into the following section on Bayes’ theorem. 

6. If you thought that some of the answers to this problem were counterintuitive, then, 
well, you haven’t seen anything yet! Tackle Problem 2.19 and you’ll see why. * 


2.5 Bayes’ theorem 

We now introduce Bayes’ theorem, which gives a relation between certain condi¬ 
tional probabilities. The theorem is relevant to much of what we have been dis¬ 
cussing in this chapter, particularly Section 2.4. We have technically already de¬ 
rived everything we need for the theorem (and we have actually already been using 
the theorem without realizing it), so the proof will be very quick. There are three 
common forms of the theorem. After we prove these, we’ll do an example and then 
present a helpful way of thinking about the theorem in terms of pictures. 

Theorem 2.1 (Bayes’theorem) The “simple form” of Bayes’ theorem is 


P(A\Z) = 


P(Z\A)-P(A ) 
PiZ) 


(2.51) 


The “explicit form" is (with “~A" shorthand for “not A”) 


P(A\Z) = 


P(Z\A)-P(A) 

P(Z\A)-P(A) + P(Z\ ~A)-P(~A) 


And the “general form” is 


(2.52) 


P(A k \Z) = 


P(Z\A k )-P(A k ) 
Hi P(Z\Aj)-P(Ai) 


(2.53) 


where the A,- are a complete and mutually exclusive set of events. That is, every 
possible outcome belongs to one (hence the “complete") and only one (hence the 
“mutually exclusive”) of the A,-. 


Proof: The simple form of Bayes’ theorem in Eq. (2.51) follows from what we 
noted back in Eq. (2.9). Since the order of A and Z doesn’t matter in P(A and Z), 
we can write down two different expressions for this probability; 


P(A and Z) = P(A|Z) ■ P(Z) 

= P(Z\A ) • P(A). (2.54) 


If we equate the two righthand sides of these equations and divide through by P(Z), 
we obtain Eq. (2.51). 
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The explicit form in Eq. (2.52) follows from the fact that the P(Z) in the de¬ 
nominator of Eq. (2.51) can be written as 

P(Z) = P(Z and A) + P(Z and ~A) 

= P(Z\A)-P(A) + P(Z\~A)-P(~A). (2.55) 

The first line here comes from the fact that every outcome is a member of either A 
or ~A , and the second line comes from two applications of Eq. (2.54). 

The general form in Eq. (2.53) is obtained by replacing the A in Eq. (2.51) with 
Ak and noting that 


P(Z) = Z P(Z and Ai) 

i 

= Yj P{Z \A i)-P(Ai). (2.56) 

i 

The first line here comes from the fact that every outcome is a member of exactly 
one of the A,-, and the second line comes from n applications (where n is the number 
of Ai) of Eq. (2.54). Note that Eq. (2.52) is a special case of Eq. (2.53), with A\ — A 
and At = ~A, and with k — 1 (so Ak — A). Note also that all of the numerators on 
the righthand sides of the three formulations of the theorem are equal to P(A and Z) 
or P(Ak and Z), from Eq. (2.54). ■ 

As promised, these proofs were very quick. All we needed was Eq. (2.54) and the 
fact that P(Z) = P(Z and A t ), which holds because the A, are mutually exclu¬ 
sive and complete. However, even though the proofs were quick, and even though 
the theorem isn’t anything we didn’t already know (since we already knew the two 
ingredients in the preceding sentence), the theorem can still be a bit intimidating, es¬ 
pecially the general form in Eq. (2.53). So we’ll do an example to get some practice. 
But first some remarks. 

Remarks: 

1. In Eq. (2.53) the P(Aj) are known as the prior probabilities, the P(Z\A,) are known 
as the conditional probabilities, and P(A^\Z) is known as the posterior probability. 
The prior and conditional probabilities are the ones you are given (at least in this book; 
see the following remark), and the posterior probability is the one you are trying to 
find. 

2. Since Bayes’ theorem is simply a restatement of what we already know, you might 
be wondering what good it is and why it comes up so often when people talk about 
probability. Does it actually give us anything new? Well, yes and no. The theorem 
itself doesn’t give us anything new, but the way in which it is used does. 

It would take many pages to do justice to this topic, but in a nutshell, there are two main 
types of probability reasoning. Frequentist reasoning (which is what we are using 
in this book) defines probability by imagining a large number of trials. In contrast, 
Bayesian reasoning doesn't require a large number of trials. The difference between 
these two reasonings shows up when one gets into statistical inference, that is, when 
one tries to estimate probabilities by gathering data (which we won't do in this book). 
In the end. the difference comes down to how one treats the prior probabilities P(Aj) 
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in Eq. (2.53). A frequentist considers them to be definite quantities (based on the 
frequencies obtained in large numbers of trials), whereas a Bayesian considers them 
to be unknowns whose values are given by specified distributions (determined in some 
manner). However, this difference is moot in this book, because we will always deal 
with situations where the prior probabilities take on definite values that are given. In 
this case, the frequentist and Bayesian reasonings are identical. They both boil down 
to Eq. (2.54). * 

Let’s now do an example. A common setup where Bayes’ theorem is relevant 
involves false positives on a diagnostic test, so that’s the setup we’ll use here. Af¬ 
ter working through the example, we’ll see how we can alternatively make use of 
a particularly helpful type of picture. There are many different probabilities that 
appear in Eq. (2.53), and it can be hard to remember what the theorem says or to get 
an intuitive feel for what’s going on. In contrast, a quick glance at a figure such as 
Fig. 2.14 below makes it easy to remember the theorem and understand it intuitively. 


Example (False positives): A hospital administers a test to see if a patient has a 
certain disease. Assume that we know the following three things: 

• 2% of the overall population has the disease. 

• If a person does have the disease, then the test has a 95% chance of correctly 
indicating that the person has it. (So 5% of the time, the test incorrectly indicates 
that the person doesn’t have the disease.) 

• If a person does not have the disease, then the test has a 10% chance of incor¬ 
rectly indicating that the person has it; this is a “false positive” result. (So 90% 
of the time, the test correctly indicates that the person doesn’t have the disease.) 

The question we want to answer is: If a patient tests positive, what is the probability 
that they 2 actually have the disease? 

We’ll answer this question first by pretending that we haven't seen Bayes’ theorem, 
and then by using the theorem. The reasoning will be exactly the same in both so¬ 
lutions, because in the first solution we’ll actually be using Bayes’ theorem without 
realizing it. 

First solution: Imagine taking a large number of people (say, 1000) from the general 
population and testing them for the disease. A given person either has the disease or 
doesn’t (two possibilities), and their test is either positive or negative (two possibili¬ 
ties). So there are 2-2 = 4 different types of people, with regard to the disease and the 
test. Let’s make a probability tree to determine how many people of each type there 
are; see Fig. 2.11. The three given facts correspond to the three forks in the tree: 

• The first fact tells us that of the given 1000 people, 2% (which is 20 people) 
have the disease (on average), while 98% (which is 980 people) don’t have the 
disease. 

• The second fact tells us that of the 20 people with the disease, 95% (which is 19 
people) test positive, while 5% (which is 1 person) tests negative. 

2 I am using “they” as a gender-neutral singular pronoun, in protest of the present failing of the English 
language. 
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• The third fact tells us that of the 980 people without the disease, 10% (which is 
98 people) test positive, while 90% (which is 882 people) test negative. 




disease 


no disease 


Figure 2.11: The probability tree for yes/no disease and positive/negative test. 


The answer to the above question (namely, “If a patient tests positive, what is the 
probability that they actually have the disease?”) can now simply be read off from 
the tree. The total number of people who test positive is the sum of the two circled 
numbers, which is 19 + 98 = 117. And of these 117 people, only 19 have the disease. 
So our answer is 


19 

19 + 98 


19 

U7 


16%. 


(2.57) 


If we want to write this directly in terms of the given probabilities, then if we recall 
how we arrived at the numbers 19 and 98, we obtain 


(0.95) (0.02) 

(0.95)(0.02) + (0.10)(0.98) 


0.16. 


(2.58) 


Second solution: We'll use the “explicit form” of Bayes’ theorem in Eq. (2.52), 
which is a special case of the “general form” in Eq. (2.53). In the notation of Eq. (2.52) 
we have 


A = have disease, 

~A = don’t have disease, 

Z = test positive. (2.59) 

Our goal is to calculate P(A\Z), that is, the probability of having the disease, given a 
positive test. From the given facts in the three bullet points, we know that 

P(A) = 0.02, 

P{Z\A) = 0.95, 

P(Z\ ~A) = 0.10. (2.60) 


Plugging these probabilities into Eq. (2.52) gives 


P(A\Z) = 


P(Z\A)-P(A) 

P(Z\A)-P(A) + P(Z\~A)-P(~A) 
(0.95)(0.02) 

(0.95)(0.02) + (0.10)(0.98) 


(2.61) 
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in agreement with the first solution. This is the same expression as in Eq. (2.58), 
which is consistent with the fact that (as we mentioned above) our reasoning in the 
first solution was equivalent to using Bayes’ theorem. 


Remark: We see that if a person tests positive, they have only a 16% chance of 
actually having the disease. This answer might seem surprisingly low. After all, the 
test seems fairly reliable; it gives the correct result 95% of the time if a person has the 
disease, and 90% of the time if a person doesn’t have the disease. So how did we end 
up with an answer that is much smaller than either of these two percentages? 

The explanation is that because the percentage of people with the disease is so tiny 
(2%), the small percentage (10%) of false positives among the non-disease people 
yields a number of false positives that is significantly larger than the number of true 
positives. Basically, 10% of 98% of 1000 (which is 98) is significantly larger than 95% 
of 2% of 1000 (which is 19). The 98 false positives dominate the 19 true positives. 
Although the 10% false-positive rate is small, it isn’t small enough to prevent the 
smallness of the 2% disease rate from controlling the outcome. A takeaway from this 
discussion is that one must be very careful when testing for rare diseases. If the disease 
is very rare, then the test must be extremely accurate, otherwise a positive test isn’t 
meaningful. 

If we decrease the 10% percentage (that is, reduce the percentage of false positives) 
and/or increase the 2% percentage (that is, increase the percentage of people with 
the disease), then the answer to our original question will increase. That is, a larger 
fraction of the people who test positive will actually have the disease. For example, if 
we assume that 40% of the population have the disease (so 60% don’t have it), and if 
we keep all the other percentages in the problem the same, then Eq. (2.58) becomes 


(0.95)(0.40) 

(0.95X0.40) + (0.10)(0.60) 


0 . 86 . 


(2.62) 


This probability is closer to 1 than in the original scenario, because if we have 1000 
people, then the 60 (instead of the earlier 98) false positives are dominated by the 380 
(instead of the earlier 19) true positives. You can verify these numbers. 

In the limit where the 10% false-positive percentage in the original scenario goes to 
zero, or the 2% disease percentage goes to 100%, the number of false positives goes 
to zero. This is true because if 10% —> 0% then the test never incorrectly says that a 
person has the disease when they don’t; and if 2% —> 100% then the entire population 
has the disease, so every positive test is a true one. In either of these limits, the answer 
to our question goes to 1 (or 100%); a positive test always correctly indicates the 
disease. * 


In the first solution above, we calculated the various numbers and probabilities 
by using a probability tree. We can alternatively use a figure along the lines of 
Fig. 2.4. In the following discussion we’ll pretend that we haven’t seen Bayes’ 
theorem, and then we’ll circle back to the theorem and show in Fig. 2.14 how the 
different ingredients in the theorem correspond to the different parts of the figure. 

Fig. 2.12 shows a pictorial representation of the probability tree in Fig. 2.11. 
The overall square represents the given 1000 people. 3 A vertical line divides the 

3 When drawing a figure like this, the area of a region can represent either the probability of being in 
that region, or the actual number of outcomes/people/etc. in that region. The usage should be clear from 
the context. We’re using actual numbers here. 
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square into two rectangles - a very thin one on the left representing the 20 people 
with the disease, and a wide one on the right representing the 980 people without the 
disease. These two rectangles are further divided into the people who test positive 
(the shaded lower regions, with 19 and 98 people) or test negative (the unshaded 
upper regions, with 1 and 882 people). The desired probability of a person having 
the disease if they test positive equals the 19 true positives (the darkly shaded thin 
rectangle) divided by the total 19 + 98 = 117 number of positives (both shaded 
regions). 


negative (5%) 


positive (95%) 
(true positives) 


f 


disease (2%) 


( 


no disease (98%) 



negative (90%) 


positive (10%) 
(false positives) 


Figure 2.12: The probability square for yes/no disease and positive/negative test. 


In Fig. 2.12 there are only two types of people in the population - those with the 
disease and those without it. As an example of a more general setup, let’s consider 
how people commute to work. We’ll assume that we are given the percentages of 
people who walk, bike, drive, take the bus, etc. And then for each of these types, 
we’ll assume that we are also given the percentage who have a particular attribute - 
for example, the ability to play the guitar. We can then ask questions such as, “If we 
pick a random person (among those who commute to work) from the set of people 
who can play the guitar, what is the probability that this person walks to work?” If 
we compare this question to our earlier one involving the disease testing, we see that 
guitar playing is analogous to testing positive, and walking to work is analogous to 
having the disease. It’s just that now we have many types of commuters instead of 
only two types of disease carriers (carriers or non carriers). 

To answer the above question, we can draw a figure analogous to Fig. 2.12; 
see Fig. 2.13 with some made-up percentages for the various types of commuters. 
These percentages are undoubtedly completely unrealistic, but they’re good enough 
for the sake of an example. 

For simplicity, we’ll assume that there are only four possible ways to commute 
to work. If the guitar players are represented by the shaded regions, then the answer 
to our question is obtained by dividing the area of the darkly shaded region (which 
represents the guitar players who are walkers) by the total area of all the shaded 
regions (which represents all of the guitar players). Mathematically, the preceding 
sentence is equivalent to dividing the first equality in Eq. (2.54) through by P{Z) 
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walk bike drive bus 


no guitar 


guitar 


Figure 2.13: The probability square for a hypothetical commuting example. 


and then letting A = “walk” and Z = “guitar”: 


P( walk | guitar) = 


P(walk and guitar) 
P(guitar) 
dark shaded area 
total shaded area 


(2.63) 


Assuming that there are only four possible ways to commute to work, we need 
to be given eight pieces of information: 


• We need to be given the four percentages of people who walk, bike, drive, or 
take the bus. (Actually, since these percentages must add up to 100%, there 
are only three independent bits of information here.) These percentages deter¬ 
mine the relative widths of the vertical rectangles in Fig. 2.13. The analogous 
information in the “False positives” example was contained in the first bullet 
point on page 99 (the percentage of people who have the disease). 


• For each of the four types of commuters, we need to be given the percent¬ 
age who play the guitar. These four percentages determine the heights of the 
shaded areas within the vertical rectangles in Fig. 2.13. The analogous infor¬ 
mation in the “False positives” example was contained in the second and third 
bullet points on page 99. 

Of course, if we are simply given the area of the darkly shaded region (which 
represents the number of guitar players who are walkers), and also the total area of 
all the shaded regions (which represents the total number of guitar players), then 
we can just divide the first of these two pieces of information by the second, and 
we’re done. But in most situations, we’re given the above eight (or whatever the 
relevant number is) pieces of information instead of these two, and the main task is 
to determine these two. 

If you want to instead think in terms of a probability tree, as in Fig. 2.11, 
then in the present commuting example, the initial fork has four branches (for the 
walk/bike/drive/bus options), and then each of these four options splits into two pos¬ 
sibilities (guitar or no guitar). We therefore end up with four circled numbers (the 
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guitar players) instead of the two in Fig. 2.11, and we need to divide one of these 
(the one in the walking branch) by the sum of all four. 

The interpretation of Bayes’ theorem in terms of a figure like Fig. 2.13 is sum¬ 
marized in Fig. 2.14. In this figure, we are considering areas to represent proba¬ 
bilities instead of actual numbers (although either way is fine), because heights and 
widths then represent the relevant probabilities. It is invariably much more intuitive 
to think of the theorem in terms of a figure instead of algebraic manipulations, so 
when you think of Bayes’ theorem, you’ll probably want to think of Fig. 2.14. 


(walk) (bike) 

A A 


not Z 
(no guitar) 


Z 

(guitar) 


(drive) 

A 


(bus) 

A 4 



Figure 2.14: Pictorial representation of Bayes’ theorem. 


Remarks: 

1. It is often the case that you aren't given P{Z) in the simple form of Bayes’ theorem in 
Eq. (2.51), but instead need to calculate it via 2 P(Aj)-P(Z\Aj) or P{Z\A)-P(A) + 
P(Z\ ~A)-P(~A), as we did in the “False positives” example. So the general form 
of Bayes’ theorem in Eq. (2.53) or the explicit form in Eq. (2.52) is often the relevant 
one. 

2. When using Bayes’ theorem to calculate P(A\\Z), remember that in the notation of 
Fig. 2.14, the first letter A\ in P(A\\Z) is one of the many A,- that divide up the 
horizontal span of the square, while the second letter Z is associated with the vertical 
span of the shaded areas. 
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3. In setups involving Bayes’ theorem, there can be an arbitrary number n of the A,- 
columns in Fig. 2.14. (We’ve drawn the case with n = 4.) But each column is divided 
into only two regions, namely the Z region and the not-Z region. Of course, the not-Z 
region might very well be broken down into other regions, but that isn’t relevant here. 
If you wish, you can think of there being only two columns, namely the A\ column 
and the “not-A[” column, which consists of all the other A,-. However, if you are given 
information for each of the A,-, then you will need to consider them separately. But 
after calculating all the relevant numbers, it is certainly fine to lump all the other A, 
together into a single “not-Aj” column. Fig. 2.14 then becomes Fig. 2.15. The lightly 
shaded area here is the same as the total lightly shaded area in Fig. 2.14. Fig. 2.15 
corresponds to the explicit form of Bayes’ theorem in Eq. (2.52), while Fig. 2.14 
corresponds to the general form in Eq. (2.53). 

walk bike/drive/bus 


no guitar 


guitar 


Figure 2.15: Grouping all of the nonwalkers together. 


4. The essence of Bayes’ theorem comes down to the fact that P{A and Z) can be written 
in the two different ways given in Eq. (2.54). In terms of Fig. 2.14, you can think of 
P{A\ and Z), which is the area of the darkly shaded rectangle, in two different ways. 
It is a certain fraction (namely P{A\\Z)) of the overall shaded area (namely P(Z))\ 
this leads to the first equality in Eq. (2.54). And P(A\ and Z) is also a certain fraction 
(namely P(Z|Ai)) of the leftmost (walking) rectangle area (namely P(Ai)); this leads 
to the second equality in Eq. (2.54). 

Said in another way, the number of guitar players who are walkers equals the num¬ 
ber of walkers who are guitar players. This common number equals the area of the 
darkly shaded rectangle (which is the probability P(A\ and Z)) multiplied by the total 
number of people. Note that the first sentence above is not true (in general) if the 
word “number” is replaced by “fraction.” That is, it is not true that the fraction of 
guitar players who are walkers equals the. fraction of walkers who are guitar players. 
Equivalently, it is not true that P(A\\Z) = / > (Z|Aj). Instead, these two conditional 
probabilities are related according to Eq. (2.51). 

5. In Section 2.4 we solved the game-show problem, the prosecutor’s fallacy, and the 
boy/girl problem without using Bayes’ theorem. However, if we had used the theorem, 
the reasoning would have been basically the same, just as the reasoning that led to 
Eq. (2.58) in the “False positives” example was basically the same as the reasoning that 
led to Eq. (2.61). We chose to discuss the problems in Section 2.4 before discussing 
Bayes’ theorem, so that it would be clear that the problems are still perfectly solvable 
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even if you’ve never heard of the theorem. If you want to solve the prosecutor’s fallacy 
by explicitly using Bayes’ theorem, see Problem 2.21. * 


2.6 Stirling’s formula 

Stirling’s formula gives an approximation to n\ that is valid for large n, in the sense 
that the larger n is, the better the approximation is. By “better,” we mean that as n 
gets large, the approximation gets closer and closer to n ! in a multiplicative sense 
(as opposed to an additive sense). That is, the ratio of the approximation and n\ 
approaches 1. (The additive difference between the approximation and n ! gets larger 
and larger as n grows, but we don’t care about that.) Stirling’s formula is given by: 


n! » n n e n V2 nn 


(Stirling’s formula) 


(2.64) 


Here e is the base of the natural logarithm, equal to e ~ 2.71828. See Appendix B 
for a discussion of e, often referred to as Euler’s number. There are various proofs 
of Stirling’s formula, but they generally involve calculus, so we’ll just accept the 
formula here. It does indeed give an accurate approximation to n\ (an extremely 
accurate one, if n is large), as you can see from Table 2.6, where S(n ) stands for the 
n n e~ n sflnn Stirling approximation. Even if n is just 10, the approximation is off 
by only about 0.8%. And although there is never any need to use the formula for 
small numbers like 1 or 5, it works surprisingly well in those cases too. 


n 

n\ 

S(n) 

S(n)/n\ 

i 

1 

0.922 

0.922 

5 

120 

118.0 

0.983 

10 

3.629 ■ 10 6 

3.599 • 10 6 

0.992 

100 

9.3326 • 10 157 

9.3249 • 10 157 

0.9992 

1000 

4.02387 • 10 2567 

4.02354 • 10 2567 

0.99992 


Table 2.6: Showing the accuracy of Stirling’s formula. 

You will note that for the powers of 10 in the table, the ratios of S(n) to n\ all 
take the same form, namely decimals with an increasing number of 9’s and then a 2. 
It’s actually not a 2, because we rounded off, but it’s essentially the same rounding 
off for all the numbers. This isn’t a coincidence. It follows from a more accurate 
version of Stirling’s formula, but we won’t get into that here. 

Stirling’s formula will be critical in Chapter 5 when we talk about approxima¬ 
tions to certain probability distributions. But for now, it is relevant when dealing 
with binomial coefficients of large numbers, because these binomial coefficients in¬ 
volve the factorials of large numbers. There are two main benefits to using Stirling’s 
formula: 

• Depending on the type of calculator you have, you might get an error mes¬ 
sage when you plug in the factorial of a number that is too big. Stirling’s 
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formula allows you to avoid this problem if you first simplify the expression 
that results from Stirling’s formula (using the letter n to stand for the specific 
number you’re dealing with), and then plug the simplified result into your 
calculator. 

• If you use Stirling’s formula and arrive at a simplified answer in terms of n 
(we’ll call this a symbolic answer since it’s written in terms of the symbol n 
instead of specific numbers), you can then plug in your specific value of n. 
Or you can plug in any other value, for that matter. The benefit of having a 
symbolic answer in terms of n is that you don’t need to solve the problem 
from scratch every time you’re given a new value of n. You simply need to 
plug the new value of n into your symbolic answer. 

These two benefits are illustrated in the following example. 


Example (50 out of 100): A coin is flipped 100 times. Calculate the probability of 
obtaining exactly 50 Heads. 


Solution: In 100 flips, there are 2 10() possible outcomes (all equally likely), of which 
("SO*) ^ ave exactl y 50 Heads. The probability of obtaining exactly 50 Heads is there¬ 
fore 




1 100 ! 
2 1 ™ 50! 50! ’ 


(2.65) 


Now. although this is the correct answer, your calculator might not be able to handle 
the large factorials. But even if it can, let’s use Stirling’s formula so that we can 
produce a symbolic answer. To this end, we’ll replace the number 50 with the letter n 
(and hence 100 with 2 n). In terms of n, we can write down the probability of obtaining 
exactly n Heads in 2 n flips, and then we can use Stirling’s formula (applied to both n 
and 2 n) to simplify the result. The first steps of this simplification will actually go in 
the wrong direction and create a big mess, but nearly everything will cancel out in the 
end. We obtain: 


P(n) = - 5 — 
2 ln 


1 (2 n)\ 

2- n n\n\ 


1 ( 2 n) 2n e~ 2n s/2n(2n) 

2^ n (n' 1 e~ n V2 nn )“ 

1 2- n rc- n e~- n ■ 2sfnn 

2 ln n ln e~ ln -2nn 

1 

s/nn 


( 2 . 66 ) 


A simple answer indeed! And the “n" is a nice touch, too. In our specific case with 
n = 50, we have 

P(50) ~ , 1 ~ 0.07979 ~ 8 %. (2.67) 

sjn • 50 

This is small, but not negligible. If we instead have n = 500, we obtain .P(500) ~ 
2.5%. This is the probability of obtaining exactly 500 Heads in 1000 coin flips. As 
noted above, we can just plug in whatever number we want, and not have to redo the 
entire calculation! 










108 


Chapter 2. Probability 


The 1/ \ftrn result in Eq. (2.66) is extremely clean. It is much simpler than the 
expression in Eq. (2.65), and much simpler than the expressions in the first two lines 
of Eq. (2.66). True, it’s only an approximate result, but it’s a good one. The exact 
result in Eq. (2.65) happens to be about 0.07959, so for n = 50 the ratio of the 
approximate result in Eq. (2.67) to the exact result is 1.0025. In other words, the 
approximation is off by only 0.25%. That’s plenty good for most purposes. 

When you derive a symbolic approximation like Eq. (2.66), you gain something 
and you lose something. You lose some truth, of course, because your answer tech¬ 
nically isn’t correct (although invariably its accuracy is quite sufficient). But you 
gain a great deal of information about how the answer depends on your input num¬ 
ber, n. And along the same lines, you gain some aesthetics. The resulting symbolic 
answer is invariably nice and concise, so it allows you to easily see how the an¬ 
swer depends on n. For example, in our coin-flipping example, the expression in 
Eq. (2.66) is proportional to 1/ \JTi. This means that if we increase n by a factor 
of, say, 100, then P(n) decreases by a factor of Vl00 = 10. So without doing any 
work, we can quickly use the P(50) » 8% result to deduce that P( 5000) « 0.8%. 
In short, there is far more information contained in the symbolic result in Eq. (2.66) 
than in the numerical 8% result obtained directly from Eq. (2.65). 


2.7 Summary 


In this chapter we learned about probability. In particular, we learned: 


• The probability of an event is defined to be the fraction of the time the event 
occurs in a very large number of identical trials. In many situations the possi¬ 
ble outcomes are all equally likely, in which case the probability of a certain 
class of outcomes occurring is 

number of desired outcomes 

p = - (for equally likely outcomes) 

total number of possible outcomes 

( 2 . 68 ) 


• The various “and” and “or” rules of probability are: 

1. For any two (possibly dependent) events, 

P(A and B) = P(A) • P(B\A). (2.69) 

2. In the special case of independent events, we have P(B\A) = P(B), so 
Eq. (2.69) reduces to 

P(A and B) = P(A ) ■ P(B). (2.70) 

3. For any two (possibly nonexclusive) events, 

P(A or B) = P(A) + P(B ) - P(A and B). (2.71) 

4. In the special case of exclusive events, we have P(A and B) = 0, so 
Eq. (2.71) reduces to 


P(AoxB) = P(A)+P(B). 


( 2 . 72 ) 
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• A and B are independent events if any one of the following relations is true: 

P(B\A) = P(B ), 

P(A\B) = P(A), 

P(A and B) = P(A) ■ P(B). (2.73) 


• The conditional probabilities P(A\B) and P(B\A) are not equal, in general. 

• Two common ways to calculate probabilities are: (1) count up the number of 
desired outcomes, along with the total number of possible outcomes, and use 
Eq. (2.68) (assuming that the outcomes are equally likely), and (2) imagine 
things happening in succession (for example, picking seats or rolling dice), 
and then multiply the relevant probabilities. The results for some problems, 
in particular the Birthday Problem and the Game-Show Problem, might seem 
surprising at first, but you can avoid confusion by methodically using one (or 
both) of these strategies. 

• Bayes’ theorem takes a variety of forms; see Eqs. (2.51)-(2.53). The last of 
these is the “general form” of the theorem: 


P(A k \Z) = 


P(Z\A k )-P(A k ) 

Ei P(Z\Ai)-P(Aj) ' 


(2.74) 


The theorem tells us how the conditional probability P(A/ < \Z) is obtained 
from the set of conditional probabilities P(Z\Aj). 


• Stirling’s formula, which gives an approximation to «!, takes the form. 


n! n n e n "flnn (Stirling’s formula) (2.75) 


This approximation is very helpful for simplifying binomial coefficients. We 
will use it a great deal in Chapter 5. 


2.8 Exercises 

See www.people.fas.harvard.edu/~djmorin/book.html for a supply of problems 
without included solutions. 


2.9 Problems 

Section 2.1: Definition of probability 

2.1. Odds * 

If an event occurs with probability p, then the odds in favor of the event 
occurring are defined to be “p to (1 - p).” (And similarly, the odds against 
the event occurring are defined to be “(1 - p) to /;.”) In other words, the odds 
are simply the ratio of the probabilities of the event occurring (namely p) and 
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not occurring (namely 1 —p). It is customary to write “p:( 1 —p)” as shorthand 
for “p to (1 - p) (The odds are sometimes also written as the ratio p/(l—p). 
But this fraction can look like a probability, which may cause confusion, so 
we’ll avoid this notation.) In practice, the probabilities p and 1 —p are usually 
multiplied through by the smallest number that turns them into integers. For 
example, odds of 1/3:2/3 are generally written as 1:2. Find the odds of the 
following events: 

(a) Getting a Heads on a coin toss. 

(b) Rolling a 5 on a die. 

(c) Rolling a multiple of 2 or 3 on a die. 

(d) Randomly picking a day of the week with more than six letters. 

Section 2.2: The rules of probability 

2.2. Rules for three events ** 

(a) Consider three events, A , B , and C. If they are all independent of each 
other, show that 

P(A and B and C) = P(A) ■ P(B) ■ P(C). (2.76) 

(b) If they are (possibly) dependent, show that 

P(A and B and C) = P(A) • P(B\A) ■ P(C\A and B). (2.77) 

(c) If they are all mutually exclusive, show that 

P(A or B or C) = P(A) + P(B ) + P(C). (2.78) 

(d) If they are (possibly) nonexclusive, show that 

P(A or B or C) = P(A) + P(B ) + P(C) 

- P(A and B) - P(A and C) - P(B and C) 

+ P(A and B and C). (2.79) 

2.3. “Or” rule for four events *** 

Parts (a), (b), and (c) of Problem 2.2 generalize quickly to more than three 
events, but part (d) is tricker. Derive the “or” rule for four (possibly) nonex¬ 
clusive events. That is, derive the rule analogous to Eq. (2.79). 

2.4. Red and blue balls * 

Show that the second expression in Eq. (2.9), with A = Red] and B = Bluei, 
gives the correct result of 3/10 for P(Red [ and Blue 2 ) in the “balls in a box” 
example on page 64. 
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2.5. Dependent events * 

Calculate the overall probability of B occurring in the scenario described by 
Fig. 2.16. 


• 20% of the width 
A / not A 


A and B 


r 


40% of ^ 
the height 


not B 
A and not B — 



B 


B and not A 


not A and not B 


B 


. 70% of 
the height 


not B 


Figure 2.16: A hypothetical probability square. 


2.6. A single horizontal line * 

There is an asymmetry in Fig. 2.16. Because there is a single vertical line but 
two horizontal lines, it is easy to read off the P{ A) and /Tnot A) probabilities, 
but not easy to read off the P(B) and P(not B ) probabilities. Hence the 
calculation in Problem 2.5. Redraw Fig. 2.16 with a single horizontal line 
and two vertical lines (while keeping the areas (probabilities) of the four sub¬ 
rectangles the same, of course). 

2.7. Proofreading ** 

Two people each proofread the same book. One person finds 100 errors, and 
the other finds 60. There are 20 errors common to both people. Assume that 
all errors are equally likely to be found (which is undoubtedly not true in 
practice), and also that the discovery of an error by one person is independent 
of the discovery of that error by the other person. Given these assumptions, 
roughly how many errors does the book have? Hint: Draw a picture similar 
to Fig. 2.1, and then find the probability of each person finding a given error. 

Section 2.3: Examples 

2.8. Red balls, blue balls ** 

Three boxes sit on a table. One box contains two red balls, another contains 
two blue balls, and the third contains one red ball and one blue ball. You 
choose one of the boxes at random, and then you draw a ball from that box. 
If it turns out to be a red ball, what is the probability that the other ball in the 
box is also red? 
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2.9. Sock pairs ** 

(a) Four red socks and four blue socks are in a drawer. You reach in and 
pull out two socks at random. What is the probability that you obtain a 
matching pair? 

(b) Answer the same question, but now in the general case with n red socks 
and ii blue socks. 

(c) Presumably you answered the above questions by counting the relevant 
pairs of socks. Can you think of a quick probability argument, requiring 
no counting, that gives the answer to part (b) (and part (a))? 

2.10. Sock pairs, again ** 

(a) As in Problem 2.9, four red socks and four blue socks are in a drawer. 
You reach in and pull out two socks at random. You then reach in and 
pull out two more socks (without looking at the socks in the first pair). 
What is the probability that the second pair you pull out is a matching 
pair? Answer this by calculating the probabilities, given that the first 
pair is (or is not) a matching pair. 

(b) You should find that the answer to part (a) is the same as the answer to 
part (a) of Problem 2.9. Can you think of a quick probability argument, 
requiring no counting, that explains why this is the case? The reasoning 
will work in the general case with n red socks and n blue socks. And 
it will also work if you draw a third pair, or a fourth pair, etc. (without 
looking at any of the other pairs). 

2.11. At least one 6 ** 

Three dice are rolled. What is the probability of obtaining at least one 6? We 
solved this in Section 2.3.1, but your task here is to solve it the long way, by 
adding up the probabilities of obtaining exactly one, two, or three 6’s. 

2.12. At least one 6, by the rules ** 

Three dice are rolled. What is the probability of obtaining at least one 6? We 
solved this in Section 2.3.1, and again in Problem 2.11. But your task here is 
to solve it by using Eq. (2.79) from Problem 2.2, with each of the three letters 
in that formula standing for a 6 on each of the three dice. 

2.13. Rolling sixes ** 

This problem was posed by Samuel Pepys to Isaac Newton in 1693 and is 
therefore known as the Newton-Pepys problem. 

(a) 6 dice are rolled. What is the probability of obtaining at least one 6? 

(b) 12 dice are rolled. What is the probability of obtaining at least two 6’s? 

(c) 18 dice are rolled. What is the probability of obtaining at least three 6’s? 
Which of the above three probabilities is the largest? 
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Section 2.4: Four classic problems 

2.14. Exactly one pair ** 

If there are 23 people in a room, what is the probability that exactly two of 
them have a common birthday? That is, we don’t want two different pairs 
with common birthdays, or three people with a common birthday, etc. 

2.15. My birthday ** 

(a) You are in a room with 100 other people. Let p be the probability that 
at least one of these 100 people has your birthday. Without doing any 
calculations, state whether p is larger, smaller, or equal to, 100/365. 

(b) Now calculate the exact value of p. 

2.16. My birthday, again ** 

We saw at the end of Section 2.4.1 that 253 is the answer to the question, 
“How many people (in addition to me) need to be present in order for there 
to be at least a 1/2 chance that someone else has my birthday?” We solved 
this by finding the smallest n for which (364/365)" is less than 1/2. Answer 
this question again, by making use of the approximation in Eq. (7.14) in Ap¬ 
pendix C. What is the answer in the general case where there are N days in a 
year instead of 365? Assume that N is large. 

2.17. My birthday, yet again ** 

With 253 other people in a room, what is the probability that exactly one of 
these people has your birthday? Exactly two? Exactly three? 

2.18. A random game-show host ** 

Consider the following variation of the Game-Show Problem we discussed 
in Section 2.4.2. A game-show host offers you the choice of three doors. 
Behind one of these doors is the grand prize, and behind the other two are 
goats. The host announces that after you select a door (without opening it), 
he will randomly open one of the other two doors. You select a door. The 
host then randomly opens one of the other doors, and the result happens to be 
a goat. He then offers you the chance to switch your choice to the remaining 
door. Should you switch or not? Or does it not matter? 

2.19. Boy/girl problem with general information *** 

This problem is an extension of the Boy/Girl Problem from Section 2.4.4. 
You should study that problem thoroughly before tackling this one. As in 
the original versions of the problem, assume that all processes are completely 
random. The new variation is the following: 

You bump into a random person on the street who says, “I have two children. 
At least one of them is a boy whose birthday is in the summer.” What is the 
probability that the other child is also a boy? 
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What if the clause is changed to, “whose birthday is on August 11th”? Or 
“who was born during a particular minute on August 11th”? Or more gen¬ 
erally, “who has a particular characteristic that occurs with probability p”? 
Hint: Make a table of all of the various possibilities, analogous to the tables 
in Section 2.4.4. 

Section 2.5: Bayes’ theorem 

2.20. A second test ** 

Consider the setup in the “False positives” example in Section 2.5. If we 
instead perform two successive tests on each person, what is the probability 
that a person who tests positive both times actually has the disease? 

2.21. Bayes’theorem for the prosecutor’s fallacy ** 

In Section 2.4.3 we discussed the prosecutor’s fallacy. Explain the fallacy 
again here, but now by using Bayes’ theorem. In particular, determine P(I|D) 
(the probability of being innocent, given that the description is satisfied) by 
drawing a figure analogous to Fig. 2.14 

2.22. Black balls and white balls ** 

One box contains two black balls, and another box contains one black ball 
and one white ball. You pick one of the boxes at random and draw a ball n 
times, with replacement after each draw. If a black ball is drawn all n times, 
what is the probability that you picked the box with two black balls? 

2.10 Solutions 

2.1. Odds 

(a) The probability of getting a Heads is 1/2, as is the probability of not getting a 
Heads. So the desired odds are 1/2:1 /2, or equivalently 1:1. These are known 
as “even odds.” 

(b) The probability of rolling a 5 is 1 /6, and the probability of not rolling a 5 is 5/6. 
So the desired odds are 1 /6:5/6, or equivalently 1:5. 

(c) There are four desired outcomes (2,3,4.6), so the “for” and “against” probabil¬ 
ities are 4/6 and 2/6, respectively. The desired odds are therefore 4/6: 2/6, or 
equivalently 2:1. 

(d) Tuesday, Wednesday, Thursday, and Saturday all have more than six letters, so 
the “for” and “against” probabilities are 4/7 and 3/7, respectively. The desired 
odds are therefore 4/7:3/7, or equivalently 4:3. 

Note that to convert from odds to probability, the odds of a: b in favor of an event 
occurring are equivalent to a probability of a/(a + b ) that the event occurs. 

2.2. Rules for three events 

(a) We can use the same type of reasoning that we used in Section 2.2.1. If we 
perform a large number of trials, then A occurs in a fraction P(A) of them. (It is 
understood here that the words “on average” follow all statements of this form.) 
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And then B occurs in a fraction P(B) of these trials, because the events are 
independent, which means that the occurrence of A doesn’t affect the probability 
of B. So the fraction of the total number of trials where A and B both occur is 
Pi A) ■ P(B). And then C occurs in a fraction P(C) of these trials, because C 
is independent of A and B. So the fraction of the total number of trials where 
all three of A, B, and C occur is P(A) ■ P(B) ■ P(C). The desired probability is 
therefore P(A) ■ P(B) ■ P{C). If you want to visualize this geometrically, you’ll 
need to use a cube instead of the square in Fig. 2.1. 

This reasoning can easily be extended to an arbitrary number of independent 
events. The probability of all of the events occurring is simply the product of all 
of the individual probabilities. 

(b) The reasoning in part (a) works again, with only slight modifications. If we 
perform a large number of trials, then A occurs in a fraction P{A) of them. 
And then B occurs in a fraction P(B\A) of these trials, by definition. So the 
fraction of the total number of trials where A and B both occur is P(A) ■ P(B\A). 
And then C occurs in a fraction P(C\A and B) of these trials, by definition. 
So the fraction of the total number of trials where all three of A, B. and C 
occur is P(A) ■ P(B\A ) • P(C\A and B). The desired probability is therefore 
P(A) ■ P{B\A) ■ P{C\A and B). 

Again, this reasoning can easily be extended to an arbitrary number of (pos¬ 
sibly) dependent events. For four events, we just need to tack on the factor 
P(D\A and B and C), and so on. 

(c) Since the events are all mutually exclusive, we don’t have to worry about any 
double counting. The total number of trials where A or B or C occurs is simply 
the sum of the number of trials where A occurs, plus the number where B oc¬ 
curs, plus the number where C occurs. The same statement must be true if we 
substitute the word “fraction” for “number,” because the fractions are related 
to the numbers via division by the total number of trials. And since the frac¬ 
tions are the probabilities, we end up with the desired result, P(A or B or C) = 
P(A) + P(B) + P{C). If there are more events, we simply have more terms in 
the sum. 

(d) This rule is more involved than the preceding three. Let’s think of the proba¬ 
bilities in terms of areas, as we did in Section 2.2.2. The generic situation for 
three events is shown in Fig. 2.17. For simplicity, we've chosen the three re¬ 
gions to be circles with the same size, but this of course isn’t necessary. The 
various overlap regions are shown, with the juxtaposition of two letters stand¬ 
ing for their intersection. So AB means “A and 6.” The labels might appear to 
suggest otherwise, but remember that A includes the whole circle, and not just 
the white part. Similarly, AB includes the dark ABC region too, and not just the 
lighter region where the AB label is. 

Our goal is to determine the total area contained in the three circles, because 
this represents the probability of “A or B or C.” We can add up the areas of the 
A, B, and C circles, but then we need to subtract off the areas that we double 
counted. These areas are the pairwise overlaps of the circles, that is, AB. AC, 
and BC (remember that each of these regions includes the dark ABC region in 
the middle). At this point, we've correctly counted all of the white and light 
gray regions exactly once. But what about the ABC region in the middle? We 
counted it three times in the A, B, and C regions, but then we subtracted it off 
three times in the AB, AC, and BC regions. So at the moment, we haven’t 
counted it at all. We therefore need to add it on once. Then every part of the 
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Figure 2.17: Venn diagram for three nonexclusive events. 


union of the circles will be counted exactly once. The total area is therefore 


Total area =A+B+C-AB-AC-BC+ ABC, (2.80) 


where we are using the regions’ labels to stand for their areas. Translating this 
from a statement about areas to a statement about probabilities yields the desired 
result, 


P(A or B or C) = P(A) + P{B ) + P(C) 

- B(A and B) - P(A and C) - P(B and C) 

+ P(A and B and C). (2.81) 


2.3. “Or” rule for four events 

As in Problem 2.2(d), we'll discuss things in terms of areas. If we add up the areas of 
four regions. A, B, C, and D, then we have double counted the pairwise overlaps, so 
we need to subtract these off. There are six of these regions: AB, AC, AD, BC, BD, 
and CD. But then what about the triple overlaps, such as ABC? We counted ABC 
three times in the A, B, and C regions, but then we subtracted it off three times in the 
AB, AC, and BC regions. So at the moment, we haven’t counted it at all. We therefore 
need to add it on once. (This is the same reasoning as in Problem 2.2(d).) Likewise 
for ABD, ACD, and BCD. Finally, what about the quadruple overlap region, ABCD2 
We counted this four times in the single regions (like A), then we subtracted it off six 
times in the double regions (like AB), and then we added it on four times in the triple 
regions (like ABC). So at the moment, we have counted it4-6 + 4 = 2 times. Since 
we want to count it only one time, we need to subtract it off once. The total area is 
therefore 


Total area = A + B + C + D 

- AB - AC - AD - BC - BD - CD 
+ ABC + ABD + ACD + BCD 

- ABCD. (2.82) 
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Writing this in terms of probabilities gives the result, 

P(A or B or C or D) = P(A) + P(B) + P(C) + P(D) 

- P(A and B) - P(A and C) - P(A and D) 

- P(B and C) - P{B and D) - P(C and D) 

+ P(A and B and C) + P(A and B and D) 

+ P(A and C and D) + P(B and C and D) 

- P(A and B and C and D ). (2.83) 


Remark: You might think that it’s a bit of a coincidence that at every stage, we either 
overcounted or undercounted each region once. Equivalently, the coefficient of every 
term in Eqs. (2.82) and (2.83) is ± 1. The same thing is true in the case of three events 
in Eqs. (2.80) and (2.81). Likewise in the case of two events in Eq. (2.18), and trivially 
in the case of one event. Is it also true for larger numbers of events? Indeed it is, and 
the binomial expansion is the key to understanding why. 

We won't go through every step, but if you want to think about it, the main points 
to realize are: First, the numbers 4, 6 , and 4 in the above counting in the four-event 
case are actually the binomial coefficients ( 4 )- (3)- This makes sense because, for 
example, the number of regions of double overlap (like AB) that contain the region 
ABCD is simply the number of ways to pick two letters from four letters, which is 
(2)- Second, the “alternating sum” - (2) + (3) equals 2 (which means that we have 
overcounted the ABCD region by one time), because this is what you obtain when 
you expand the righthand side of 0 = (1 - l ) 4 with the binomial expansion. (This is a 
nice little trick.) And third, you can show how this generalizes to a larger number n of 
events. For even n, the alternating sum of the relevant binomial coefficients is 2, as we 
just saw for n = 4. For odd n, the alternating sum is zero, which means that we have 
undercounted by one time. (The relevant binomial coefficients are all but the first and 
last in the expansion of (1 - 1 )", and these two coefficients are either 1 and 1 for even 
n, or 1 and -1 for odd n.) For example, + (3) - (4) = 0. This “alternating 

sum” rule for counting is known as the inclusion-exclusion principle. * 

2.4. Red and blue balls 

By counting the various kinds of pairs in Table 2.1, we find P(Blue 2 ) = 12/20 = 3/5 
(by looking at all 20 pairs), and P(Redi |Blue 2 ) = 6/12 = 1/2 (by looking at only the 
12 pairs below the horizontal line). So we have 


R(Redi andBluei) = PfBlueo) • P(Redi |Blue 2 ) 
_ 3 1 _ 3_ 

~ 5 ' 2 ~ To’ 


(2.84) 


in agreement with Eq. (2.10). As mentioned in the third remark on page 66, it still 
makes sense to talk about P(Redi IBlue?), even though the second pick happens after 
the first pick. 

2.5. Dependent events 

First solution: This problem is equivalent to finding the fraction of the total area 
that lies above the horizontal line segments in Fig. 2.16. The upper left region is 
40% = 2/5 of the area that lies to the left of the vertical line, which itself is 20% = 1/5 
of the total area. And the upper right region is 70% = 7/10 of the area that lies to the 
right of the vertical line, which itself is 80% = 4/5 of the total area. The fraction of 
the total area that lies above the horizontal line segments is therefore 


1 

5 



7 

To 


2 14 _ 16 

25 + 25 ” 25 


64%. 


(2.85) 
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Second solution: We'll use the rule in Eq. (2.5) twice. First, note that 

P(B) = P(A and B) + P((not A) and B). (2.86) 

This is true because either A happens or it doesn’t. We can apply Eq. (2.5) to each of 
the two terms in Eq. (2.86) to obtain 


P(B) = P(A) ■ P{B\A) + P (not A) ■ P(B \not A) 
_ 1 
~ 5 


2 4 7 2 14 16 , Anr 

- + -• — = — + — = — = 64%, 

5 5 10 25 25 25 


(2.87) 


which is exactly the same equation as in the first solution. This is no surprise, of 
course, because the two solutions are actually same. They are simply presented in a 
different language. Comparing the solutions makes it clear how conditional probabil¬ 
ities like P(B\A) are related to fractional areas. 

2.6. A single horizontal line 

As usual, let the total area of the square in Fig. 2.16 be 1. Then from the given lengths 
along the sides of the square, we find that the upper two areas (probabilities) are 0.08 
and 0.56, for a total of 0.64; this is P(B). And the lower two areas are 0.12 and 
0.24, for a total of 0.36; this is /’(not B). The single horizontal line in Fig. 2.18 must 
therefore be 64% of the way down from the top of the square. And the two vertical 
lines must be 0.08/0.64 = 12.5% and 0.12/0.36 = 33.3% of the way from the left 
side. The four areas are the same (by construction) as in Fig. 2.16. It’s just that in 
Fig. 2.18, the P(B) = 0.64 probability is clear by simply looking at the figure. If we 
wanted to calculate P(A) from Fig. 2.18, we would have to do a calculation analogous 
to the one we did in Problem 2.5. 


^ 12.5% of the width 
A/ not A 


A and/? 


64% of 
the height 



not B 


A and not B A \ not A 

■ 33.3% of the width 


Figure 2.18: Redrawing Fig. 2.16 with a single horizontal line. 


2.7. Proofreading 

The breakdown of the errors is shown in Fig. 2.19. If the two people are labeled A and 
B , then 20 errors are found by both A and B, 80 are found by A but not B, and 40 are 
found by B but not A. 

If we consider only the 100 errors found by A, we see that 20 of them are found by 
B, which is a 1/5 fraction. Since we are assuming that B finding a given error is 
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A not A 


20 

40 

80 



Figure 2.19: Breakdown of errors found by A and B. 


independent of A finding it, we see that if B finds 1/5 of the errors found by A, then he 
must find 1/5 of the complete set of errors (on average). So 1/5 is the probability that 
B finds any given error. Therefore, since we know that B found a total of 60 errors, 
the total number N of errors in the book must be given by 60 /N =1/5 ==> N = 300. 
The unshaded region in Fig. 2.19 therefore represents 300 - 80 - 20 - 40 = 160 errors. 
This is the number that both people missed. 

We can also do things the other way around. If we consider only the 60 errors found 
by B, we see that 20 of them are found by A, which is a 1/3 fraction. By the same 
reasoning as above, this 1/3 is the probability that A finds any given error. And since 
we know that A found a total of 100 errors, the total number N must be given by 
100/A = 1/3 => N = 300, as above. 

Another method (although in the end it’s the same as the above methods) is the fol¬ 
lowing. Let the area of the unshaded region in Fig. 2.19 be x. Then if we look at how 
the areas of the two vertical rectangles are divided by the horizontal line, we see that 
the ratio of x to 40 must equal the ratio of 80 to 20. So x = 160, as we found above. 
Alternatively, if we look at how the areas of the two horizontal rectangles are divided 
by the vertical line, we see that the ratio of x to 80 must equal the ratio of 40 to 20. So 
again, x = 160. 

It is quite fascinating that you can get a sense of the total number of errors just by 
comparing the results of two readers’ independent proofreadings. There is no need 
to actually find all the errors and count them up, if you only want to make a rough 
estimate. The larger the numbers involved, the better the estimate, in a multiplicative 
sense. 

2.8. Red balls, blue balls 

Let’s ignore for a moment the fact that you happen to draw a red ball. Without this 
condition, there are six equally likely results of the process; you are equally likely to 
draw any of the six balls in the boxes. This fact can be argued by symmetry (there is 
nothing special about any of the balls). Or you can break down the probabilities: you 
have a 1/3 chance of drawing a given box, and then a 1/2 chance of drawing a given 
ball in that box. So all of the probabilities are equal to (1/3) (1/2) = 1/6. 

Let's use the numbers 1 through 6 to label the balls: 1 and 2 are the two red balls in the 
first box, 3 and 4 are the two blue balls in the second box, and 5 and 6 are, respectively, 
the red and blue balls in the third box. If you play n games, where n is large, you will 
obtain approximately n /6 of each of the numbers 1 through 6. 

Let’s now invoke the fact that you draw a red ball. This means that the n/2 games 
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where you draw a blue ball (3, 4, and 6) aren't relevant. Only the n/2 games where 
you draw a red ball (1, 2, and 5) are relevant. And of these games, 2/3 have the 
ball coming from the first box (in which case the other ball is red), and 1/3 have the 
ball coming front the third box (in which case the other ball is blue). The desired 
probability that the other ball is red is therefore 2/3. 

Remarks: 

1. The statement of the problem asks for the probability that the other ball in the 
box is also red, given that you draw a red ball. Since the word “probability” is 
used, it is understood that we must consider a large number of trials and look 
at what happens, on average, in these trials. Although the setup in the problem 
mentions only one trial, we must consider many. The given question, namely 
“If it turns out to be a red ball, what is the probability that the other ball in the 
box is also red?,” is really just shorthand for the question, “If you run a large 
number of trials and look only at the ones where the drawn ball is red, in what 
fraction of these trials is the other ball in the box also red?” 

2. In the statement of the problem, the clause, “You choose one of the boxes at 
random,” is critical. Consider the alternative question: “Someone gives you a 
box containing either two red balls, two blue balls, or one of each. You draw a 
ball from this box. If it turns out to be a red ball, what is the probability that the 
other ball in the box is also red?” This question is unanswerable, because for 
all you know, the person always gives you a box with two red balls. Or perhaps 
she always gives you a box with one ball of each color, and you just happened 
to pick the red ball. Maybe it’s 90% the former and 10% the latter, or maybe it 
depends on the day of the week. There is no way to tell what happens in a large 
number of trials. Even if you do perform a large number of trials and throw 
away the ones where you pick a blue ball, there is still no way to determine the 
probability associated with a future trial, because at any point the person might 
change her rules for the type of box she gives you. 

3. What if, instead of three equally likely boxes sitting on the table, we have a 
single box and we color each of the two balls red or blue, based on coin tosses? 
There are then four equally likely possibilities for the contents of the box: RR, 
RB. BR, and BB. We therefore effectively have four equally likely boxes instead 
of three. You can show, with a quick modification of our original reasoning, that 
the answer is now 1/2 instead of 2/3. 

This result of 1/2 makes intuitive sense, due to the following alternative rea¬ 
soning. Imagine picking a ball, without looking at it. The other ball has a 1/2 
chance of being red, because its color is determined by a coin flip. Now look at 
the ball you picked. The other ball still has a 1/2 chance of being red, because 
your act of looking at the ball you picked can’t change the color of the other 
ball. Therefore, if the ball you picked is red, then the other ball has a 1/2 chance 
of being red. Of course, the same thing is true if the ball you picked is blue, but 
those trials don’t have anything to do with the given setup where you pick a red 
ball. * 

2.9. Sock pairs 

(a) The total number of possible pairs that you can draw from the eight socks in 
the drawer is (*) = 28. The number of ways that you can draw a red pair from 

the four red socks is (o) = 6. Likewise for the four blue socks. So there are 
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12 ways in all that you can draw a matching pair. The desired probability is 
therefore 12/28 = 3/7. 

(b) If there are now n red and n blue socks in the drawer, the total number of possible 
pairs that you can draw is ( 2 ") = 2n(2n - 1 )/2. The number of ways that you 

can draw a red pair from the n red socks is - w(« ~ D/2. Likewise for the n 
blue socks. So there are n(n - 1) ways in all that you can draw a matching pair. 
The desired probability is therefore 

n(n — 1) n — I 

---— = -. ( 2 . 88 ) 

2n(2n - l)/2 2n - 1 

If n = 4, this yields the probability of 3/7 that we obtained in part (a). 

(c) For the quick probability argument, imagine drawing the two socks in succes¬ 
sion. The first sock is either red or blue. Whichever color it is, there are now 
n - 1 socks remaining of that color. And there are 2 n - 1 socks remaining in 
all. So the probability that the second sock has the same color as the first is 
(n - l)/(2n - 1). 

For large n. this result approaches 1/2. This makes sense because if n is large, 
the removal of the first sock from the drawer only negligibly changes the distri¬ 
bution of socks from 50-50. So you’re basically flipping a coin with the second 
sock. 


2.10. Sock pairs, again 


(a) We know from Problem 2.9 that there is a 3/7 probability of obtaining a match¬ 
ing first pair, and hence a 4/7 probability of obtaining a non-matching first pair. 
So there is a 3/7 probability that we are left with two socks of one color and 
four of the other, and there is a 4/7 probability that we are left with three socks 
of each color. 

In the first of these two cases, there are (j) = 15 possible pairs we can draw 

for our second pair, of which ( 2 ) + ( 2 ) = 1+6 = 7 are matching pairs. The 
probability that the second pair is matching, given that the first pair is matching 
(which happens with probability 3/7), is therefore 7/15. 

Similarly, in the second of the two cases, there are again = 15 possible pairs 

we can draw for our second pair, of which (?) + ( 2 ) = 3 + 3 = 6 are matching 
pairs. The probability that the second pair is matching, given that the first pair 
isn 't matching (which happens with probability 4/7), is therefore 6/15. 

The desired probability (that the second pair is matching) is therefore 


3 7 4 6 _ 21 + 24 _ 3 

7 ' L5 + 7 ’ 15 " 105 " 7 ' 


(2.89) 


You can apply the same reasoning to the general case with n red and n blue 
socks, but it gets a bit messy. In any event, there is no need to work through the 
algebra, because there is a much quicker line of reasoning in part (b) below. 

(b) We’ll be general from the start here. That is, we’ll assume that we have n socks 
of each color, and that we successively draw n pairs until there are no socks left 
in the drawer. We claim that all n pairs have the same (n- I )/(2n- 1) probability 
of matching, assuming that we haven’t looked at any of the other pairs yet. This 
assumption is important; we must not have any knowledge of the other pairs. If 
we do have knowledge, then this affects the probabilities for future pairs. For 
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example, in part (a) above, we saw that if the first pair is matching, the second 
pair has a 7/15 chance of matching. But if the first pair isn’t matching, the 
second pair has a 6/15 chance of matching. 

Imagine drawing the 2 n socks in succession and lining them up on a table. We 
can label them as ,S'i, S2, S 3 , •••, S2„- We can then divide them into n pairs, 
(i'l,i 2 )> (^ 3 ,^ 4 ), •••, (S 2 n-i> s 2n )• If we ask f° r the probability that, say. the 
third pair (socks 55 and sg) is matching (assuming we haven’t looked at any 
of the other pairs), we can now imagine looking at this particular pair. And 
if we look at 3-5 first and then at sg, we can use the reasoning in part (c) of 
Problem 2.9 to say that the probability of a matching pair is (n - l)/(2 n - 1). 
This reasoning works for any of the n pairs; there is nothing special about a 
specific pair (assuming we haven’t looked at any of the other pairs). All pairs 
therefore have equal in - 1) / (2 n - 1) probabilities of being matching pairs. 

The point here is that if you don’t look at the pairs you’ve already picked, then 
for all practical purposes the present pair you’re picking is the first pair. The 
order in which you draw the pairs therefore doesn’t matter, so the desired prob¬ 
abilities are all equal. 


2.11. At least one 6 

The probability of obtaining exactly one 6 equals ( 3 ) ■ (l/6)(5/6) 2 , because there are 
(j) = 3 ways to pick which die is the 6 . And then given this choice, there is a 1/6 
chance that the die is in fact a 6 , and a (5/6 ) 2 chance that both of the other dice are 
not 6 ’s. 

The probability of obtaining exactly two 6 ’s equals ( 3 ) • (l/6) 2 (5/6), because there 
are ( 3 ) = 3 ways to pick which two dice are the 6 ’s. And then given this choice, there 
is a (1/6 ) 2 chance that they are in fact both 6 ’s, and a 5/6 chance that the other die is 
not a 6 . 

The probability of obtaining exactly three 6 ’s equals • (1/6) 3 , because there is just 

( 3 ) = 1 way for all three dice to be 6 ’s. And then there is a (1/6 ) 3 chance that they 
are in fact all 6 ’s. 

The total probability of obtaining at least one six is therefore 




5 

6 ' + 


75 15 1 

216 + 2 l 6 + 216 

91 

216’ 


(2.90) 


in agreement with the result in Section 2.3.1. 


Remark: If we add this result to the probability of obtaining zero 6 ’s, which is (5/6) 3 , 
the sum is 1, because we have now taken into account every possible outcome. This 
fact was what we used to solve the problem the quick way in Section 2.3.1, after all. 
But let’s pretend that we don’t know the sum is 1, and let’s verify this explicitly. If we 
write (5/6 ) 3 suggestively as ( ( 3 ) • (5/6) 3 , then our goal is to show that 


.n!K 


ini) -e 


1 \ 2 /5 


.3 

6 1 + \3 


= 1 . 


(2.91) 


This is indeed a true statement, because the lefthand side is simply the binomial ex¬ 
pansion of (5/6 + 1/6 ) 3 = 1. This makes it clear why the sum of the probabilities of 
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the various outcomes will still be 1, even if we have, say, an eight-sided die (again, 
forgetting that we know intuitively that the sum must be 1). The only difference is 
that we now have the expression (7/8 + 1/8) 3 = 1, which is still true. And any other 
exponent (that is, any other number of rolls) will also yield a sum of 1, as we know it 
must. * 

2.12. At least one 6, by the rules 

We'll copy Eq. (2.79) here: 

P(A or B or C) = P(A) + P(B) + P(C ) 

- P(A and B) - P(A and C) - P(B and C) 

+ P{A and B and C). (2.92) 


The lefthand side of this equation is the probability of obtaining at least one 6. (Re¬ 
member that the “or” is the “inclusive or.”) So our task is to evaluate the righthand 
side, which involves three different types of terms. 

The probability of obtaining a 6 on any given die (without caring what happens with 
the other two dice) is 1/6, so 

P(A) = P(B) = P(C) = - . (2.93) 

6 

The probability of obtaining 6's on two given dice (without caring what happens with 
the third die) is (1/6) 2 , so 


P(A and B) = P(A and C) = P(B and C) = — . 

36 

The probability of obtaining 6’s on all three dice is (1 /6) 3 , so 


(2.94) 


P(A and B and C) = — . (2.95) 

Eq. (2.92) therefore gives the probability of obtaining at least one 6 as 

3 • - - 3 ■ — + —= 108 ~ ^ + 1 = (2 . 96) 

6 36 216 216 216 

in agreement with the result in Section 2.3.1 and Problem 2.11. 

2.13. Rolling sixes 


(a) In all three parts of this problem, there are far fewer ways to fail to obtain the 
specified number of 6’s than to succeed. So we’ll calculate the probability of 
failure and then subtract that from 1 to obtain the probability of success. 

If 6 dice are rolled, the probability of obtaining zero 6's is (5/6) 6 . The proba¬ 
bility of obtaining at least one 6 is therefore 


1 - 



= 0.665. 


(2.97) 


(b) If 12 dice are rolled, the probability of obtaining zero 6’s is (5/6) 12 , and the 
probability of obtaining exactly one 6 is ((1 /6) 1 (5/6) 11 , because there are 

possibilities for the one die that shows a 6. The probability of obtaining at 
least two 6's is therefore 



= 0.619. 


(2.98) 
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(c) Similarly, if 18 dice are rolled, the probability of obtaining zero 6’s is (5/6) 18 , 
the probability of obtaining exactly one 6 is ( 1 1 8 )(l/6)*(5/6) 17 , and the prob¬ 
ability of obtaining exactly two 6’s is (lf)(l/6) 2 (5/6) 16 . The probability of 
obtaining at least three 6’s is therefore 


1 - 






(2.99) 


We see that the probability in part (a) is the largest. 


Remark: We can also pose the problem with larger numbers of rolls. For 
example, if 600 dice are rolled, what is the probability of obtaining at least 
100 6’s? Or more generally, if 6 n dice are rolled, what is the probability of 
obtaining at least n 6’s? From the same type of reasoning as above, the answer 
in the general case is 



( 2 . 100 ) 


For large n, it is intractable to evaluate this sum by hand. But it’s easy to use a 
computer to evaluate it for any n. For n = 10, 100, and 1000 we obtain prob¬ 
abilities of, respectively, 0.554, 0.517, and 0.505. These probabilities decrease 
with n, and they appear to approach the nice simple answer of 1 /2 in the n —> oo 
limit. See Problem 5.2 for an explanation of where this 1/2 comes from. * 


2.14. Exactly one pair 

There are possible pairs that can have the common birthday. Let's look at one 
particular pair and calculate the probability that these two people have a common 
birthday, while everyone else has a unique birthday. We’ll then multiply this result by 
to account for all the possible pairs. 

The probability that a given pair has a common birthday is 1/365, because the first 
person's birthday can be chosen to be any day, and then the second person has a 1/365 
chance of matching that day. We then need the 21 other people to have 21 different 
birthdays, none of which is the same as the pair’s birthday. The first of these people 
can end up in any of the remaining 364 days; this happens with probability 364/365. 
The second of these people can end up in any of the remaining 363 days; this happens 
with probability 363/365. And so on, until the 21st of these people can end up in any 
of the remaining 344 days; this happens with probability 344/365. 

The total probability that exactly one pair has a common birthday is therefore 


/23\ 1 364 363 362 344 

\ 2 / 365 365 365 365 . 365' 


( 2 . 101 ) 


Multiplying this out gives 0.363 = 36.3%. This is smaller than the “at least one com¬ 
mon birthday” result of 50.7% that we found in Section 2.4.1 for 23 people, as it must 
be. The remaining 50.7%-36.3% = 14.4% probability corresponds to occurrences of 
two different pairs with common birthdays, or three people with a common birthday, 
etc. 

2.15. My birthday 


(a) p is smaller than 100/365. If the events “Person A having your birthday” and 
“Person B having your birthday,” etc., were all mutually exclusive, then p would 
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be equal to 100/365. But these events are not mutually exclusive, because it is 
certainly possible for two (or more) of the people to have your birthday. These 
multiple-event probabilities are counted twice (or more) in the naive 100/365 
result. So they must be subtracted off in order to obtain the correct probability. 
The correct probability is therefore smaller than 100/365. 

Note that if we replace the number 100 here by 365 (or anything larger), then 
the “smaller” answer is obvious, because the probability p is certainly smaller 
than 365/365 = 1. This suggests (although it doesn't prove) that the answer for 
the number 100 (or any other number) is “smaller.” The one exception is where 
100 is replaced by 1, that is, where there is only one other person in the room. 
In this case we don't have to worry about double counting any probabilities, so 
the answer is exactly 1/365. 

(b) The probability that no one out of the 100 people has your birthday equals 
(364/365) 111(1 . The probability that at least one of them does have your birthday 
is therefore 

/ 364 V 100 

= 0.24. (2.102) 

This is indeed smaller than 100/365 = 0.27. It is only slightly smaller, though, 
because the multiple-event probabilities are small. 

2.16. My birthday, again 

We may as well be general right from the start and assume that there are N days 
in a year. We can eventually set N = 365. If there are N days in a year, then the 
probability that no one out of n people has your birthday equals (1 - l/N)". This 
is an exact expression, but we can simplify it by making use of the approximation in 
Eq. (7.14). namely (1 + a) n as e na . With a = -l/N here, (1 - l/N)' 1 becomes 

|l-ij *e~ n/N . (2.103) 

Our goal is to have this probability be smaller than 1/2, so that the probability that 
someone does have your birthday is larger than 1/2. Taking the log of both sides of 
e~ n l N <1/2 gives 

n 1 1 \ n n 

~~N < ln (2) => -N<~ ln2 => N >ln2 (2 ' 104) 

=> h > 7Vln2 as (0.693)1V. 

Therefore, if n > N In 2, it is more likely than not that at least one of the n people 
has your birthday. For N = 365, we find that N In 2 is slightly less than 253, so this 
agrees with the (exact) result we obtained by simply taking the ftth power of 364/365. 
Since In 2 is very close to 0.7, a quick approximation to the answer to this problem is 
(0.7W. 

2.17. My birthday, yet again 

One person: The probability that a specific person has your birthday is 1/365. Since 
we want exactly one person to have your birthday, we want none of the other 252 
people to have it; this occurs with probability (364/3651 252 . There are 253 ways to 
pick the specific person who has your birthday, so the total probability that exactly 
one of the 253 people has your birthday is 

1 (364 V 252 

365 \365) = 0347 - (2J05) 
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Two people: The probability that two specific people have your birthday is (1/365) 2 . 

The probability that none of the other 251 people have your birthday is (364/365) 251 . 

( 253 \ 

2 1 ways to pick the two specific people who have your birthday, so the 
total probability that exactly two of the 253 people have your birthday is 


/253\ / 1 \ 2 /364\ 251 
( 2 j \365 / \365 ) 


0 . 120 . 


(2.106) 


Three people: By similar reasoning, the probability that exactly three of the 253 
people have your birthday is 


/253W 1 \ 3 / 364\ 250 
1 3 ) \365 ) \365 / 


0.0276. 


(2.107) 


The pattern is clear. The probability that exactly k people have your birthday is 


P(k) = 




k / 364\ 253 k 
\365 / 


(2.108) 


For k = 0, this gives the (364/3651 253 » 1/2 probability (obtained at the end of 
Section 2.4.1 and in Problem 2.16) that no one has your birthday. Note that the P{k) 
probabilities are simply the terms in the binomial expansion: 


/ 1 364 \ 253 

\365 + 365/ 




253-k 


(2.109) 


Since the lefthand side of this equation equals 1, we see that the sum of the P(k) also 
equals 1. This must be the case, of course, because the number of other people who 
have your birthday has to be something. 

2.18. A random game-show host 

We'll solve this problem by listing out the various possibilities. Without loss of gener¬ 
ality, assume that you pick the first door. (You can repeat the following reasoning for 
the other doors if you wish. It gives the same result.) There are three equally likely 
possibilities for what is behind the three doors: PGG, GPG, and GGP, where P denotes 
the prize and G denotes a goat. For each of these three possibilities, since you picked 
the first door, the host opens either the second or third door (with equal probabilities). 
So there are six equally likely results of his actions. These are shown in Fig. 2.7, with 
the bold letters signifying the object revealed. 



PGG 

GPG 

GGP 

open 2nd door 

PGG 

GPG 

GGP 

open 3rd door 

PGG 

GPG 

GGP 


Table 2.7: There are six equally likely scenarios with a randomly opened door, as¬ 
suming that you pick the first door. 

We now note that the two results where the prize is revealed (the crossed-out GPG 
and GGP results) are not relevant to this problem, because we are told that the host 
happens to reveal a goat. Only the four other results are relevant: 


PGG 


PGG 


GPG 


GGP 
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They are all still equally likely, so their probabilities must each be 1/4. We see that 
if you don’t switch from the first door, you win on the first two of these results and 
lose on the second two. And if you do switch, you lose on the first two and win on 
the second two. So either way, your probability of winning is 1 /2. It therefore doesn’t 
matter if you switch. 

Remarks: 

1. In the original version of the problem in Section 2.4.2, the probability of winning 
was 2/3 if you switched. How can it possibly decrease to 1/2 in the present 
random version, when in both versions the exact same thing happened, namely 
the host revealed a goat? 

The difference is due to the two cases where the host reveals the prize in the 
random version (the GPG and GGP cases). You don’t benefit from these cases 
in the random version, because we are told in the statement of the problem that 
they don't exist. But in the original version, they represent guaranteed success 
if you switch, because the host is forced to open the other door, which is a goat. 
But still you may say, “If there are two setups, and if I pick, say, the first door 
in each, and if the host reveals a goat in each (by prediction in one case, and by 
random pick in the other), then exactly the same thing happens in both setups. 
How can the resulting probabilities (for winning on a switch) be different?” 
The answer is that although the two outcomes are the same, probabilities have 
nothing to do with two setups. Probabilities are defined only for a large number 
of setups. And if you play a large number of these pairs of games (prediction 
in one, random pick in the other), then in 1 /3 of the pairs the host will reveal 
different things (a goat in the prediction version and the prize in the random 
version). These cases yield success in the original prediction version, but they 
are irrelevant in the random version. They are effectively thrown away there. 

2. We will now address the issue mentioned in the fourth remark in Section 2.4.2. 
We correctly stated in Section 2.4.2 that in the original version of the problem, 
“No actions taken by the host can change the fact that if you play a large num¬ 
ber n of these games, then (roughly) n/3 of them will have the prize behind the 
door you initially pick.” However, in the present random version of the problem, 
something does affect the probability that the prize is behind the door you ini¬ 
tially pick. It is now 1 /2 instead of 1 /3. So can something affect this probability 
or not? 

Well, yes and no. If all of the n games are considered (as in the original version), 
then n/3 of them have the prize behind the initial door, and that’s that. However, 
the random version of the problem involves throwing away 1/3 of the games (the 
ones where the host reveals the prize), because it is assumed in the statement of 
the problem that the host happens to reveal a goat. So for the remaining games 
(which are 2/3 of the initial total, hence 2n/3), 1/2 of them now have the prize 
behind your initial door. 

If you play a large number n of games of each version (including the n/3 games 
that are thrown away in the random version), then the actual number of games 
that have the prize behind your initial door is the same, namely n/3. It’s just 
that in the original version this number can be thought of as 1/3 of n, whereas in 
the random version it can be thought of as 1/2 of 2n/3. So in the end, the thing 
that influences the probability (that the initial door you pick has the prize) and 
changes it from 1/3 to 1/2 isn’t the opening of a door, but rather the throwing 
away of 1/3 of the games. Since no games are thrown away in the original 
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version, the above statement in quotes is correct (with the key phrase being 
“these games”). 

3. As with the original version of the problem, if you find yourself arguing about 
the answer for an excessive amount of time, you should just play the game 
a bunch of times (at least a few dozen, to get good enough statistics). The 
randomness can be determined by a coin toss. As mentioned above, you will 
end up throwing away 1/3 of the games (the ones where the host reveals the 
prize). * 

2.19. Boy/girl problem with general information 

Let's be general right from the start and consider the case where the boy has a partic¬ 
ular characteristic that occurs with probability p. (So p = 1/4 if the characteristic is a 
summer birthday.) As in all of the versions of this problem in Section 2.4.4, we’ll list 
out the various possibilities in a table, before the parent’s additional information (be¬ 
yond “I have two children”) is taken into account. It is still the case that the BB, BG, 
GB, and GG types of two-child families are all equally likely, with a 1/4 probability 
for each. We are again ordering the children in a given pair by age; the first letter is 
associated with the older child. But we could just as well order them by, say, height or 
shoe size. 

In the present version of the problem, there are now various different subtypes within 
each type of family, depending on whether or not the children have the given character¬ 
istic (which occurs with probability p). For example, if we look at the BB types, there 
are four possibilities for the occurrence(s) of the characteristic. With “y” standing 
for “yes, the child has the characteristic,” and “n” standing for “no, the child doesn’t 
have the characteristic.” the four possibilities are ByBy. B y B n , B n B y , and B n B n . (In 
the second possibility here, for example, the older boy has the characteristic, and the 
younger boy doesn’t.) Since y occurs with probability p, we know that n occurs with 
probability 1 — p. The probabilities associated with each of the four possibilities are 
therefore equal to the 1/4 probability that BB occurs, multiplied by, respectively, p 2 , 
p(l-p),(l-p)p,and(l-p) 2 . 

The same reasoning holds with the BG, GB, and GG types, so we obtain a total of 
4 ■ 4 = 16 distinct possibilities. These are listed in Table 2.8 (ignore the boxes for a 
moment). The four subtypes in any given row all have the same occurrence(s) of the 
characteristic, so they all have the same probability; this probability is listed on the 
right. The subtypes in the middle two rows all have equal probabilities. As mentioned 
above, in the case where the given characteristic is “having a birthday in the summer,” 
p equals 1/4. So the probabilities associated with the four rows in that case are equal 
to 1/4 multiplied by, respectively, 1/16, 3/16, 3/16, and 9/16. 

Before the parent gives you the additional information, all 16 of the subtypes in the 
table are possible. But after the statement is made that there is at least one boy with 
the given characteristic (that is, there is at least one B y in the pair of children), only 
seven subtypes remain. These are indicted with boxes. The other nine subtypes are 
ruled out. 

We now simply observe that the three boxes in the left-most column in the table have 
the other child being a boy, while the four other boxes in the second and third columns 
have the other child being a girl. The desired probability that the other child is a boy 
is therefore equal to the sum of the probabilities of the left three boxes, divided by the 
sum of the probabilities of all seven boxes. This gives (ignoring the common factor of 
1/4 in all of the probabilities) 

p P~ + 2-p(l - p) _ 2p - p 1 _ 2 — p 

BB 3-p 2 + 4-p(l — p) 4p - p- 4 — p 


( 2 . 110 ) 
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BB 

yy 

yn 
ny 

nn I B n B n 


ByBy 

ByB n 

B n B y 


BG GB 


By Gy 
ByGn 
B n G y 



GyB n 


G n By 


B n G n 


G n B n 


GG 

GyG 

GyG 


G n G 


G n G 


Probability 
y (1/4) ' p 2 
n (1/4) ' p(l - p) 
y (1/4). P {\ -p) 
n (1/4) •(! -p) 2 


Table 2.8: The 16 types of families. 


In the case where the given characteristic is “having a birthday in the summer,” p 
equals 1/4. Plugging this into Eq. (2.110) gives the probability that the other child is 
also a boy as Pbb = 7/15 = 0.467. 

If the given characteristic is “having a birthday on August 11th,” then p = 1/365, 
which yields P B B = 729/1459 = 0.4997 a 1/2. 

If the given characteristic is “being born during a particular minute on August 11th,” 
then p is essentially equal to zero, so Eq. (2.110) tells us that PgB is essentially equal 
to 1/2. This makes sense, because if p = 0, then the p( 1 - p) probability for the 
middle two rows in Table 2.8 is much larger than the p~ probability for the top row. 
Of course, all of these probabilities are very small in the small-p limit, but p 2 is much 
smaller than p(l - p) ~ p when p is small. So we can ignore the top row. We are then 
left with four boxes, two of which are BB and two of which are BG/GB. The desired 
probability therefore equals 1/2. 

Another somewhat special case is p = 1 /2. (You can imagine that every child flips 
a coin, and we're concerned with the children who get Heads.) In this case we have 
p = 1 -p, so all of the probabilities in the righthand column in Table 2.8 are equal. All 
16 entries in the table therefore have equal probabilities (namely 1/16). Determining 
probabilities is then just a matter of counting boxes, so the answer to the problem is 
3/7. because three of the seven boxes are of the BB type. 

Remarks: 

1. The above Pbb ~ 1/2 result in the pa 0 case leads to the following puzzle. 
Let’s say that you bump into a random person on the street who says, “I have 
two children. At least one of them is a boy.” At this stage, you know that the 
probability that the other child is also a boy is 1/3, from part (a) of the original 
problem in Section 2.4.4. But if the parent then adds, “... who was born during 
a particular minute on August 11th,” then we just found that the probability that 
the other child is also a boy jumps to (essentially) 1/2. Why exactly did this 
jump take place? 

In the original scenario in Section 2.4.4, there were three equally likely possi¬ 
bilities after the parent gave the additional information, namely BB. BG, and 
GB. Only 1/3 of these cases (namely BB) had the other child being a boy. In 
the new scenario (with p « 0), there are four equally likely possibilities after the 
parent gives the additional information, namely B y B n , B n B y , B y G n , and G n B y . 
(As mentioned above, we're ignoring the top row in Table 2.8 since p * 0.) 
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So in the new scenario, 1/2 of these cases (the two BB cases) have the other 
child being a boy. The critical point here is that BB now counts twice, whereas 
it counted only once in the original scenario. This is due to the fact that a BB 
parent is twice as likely (compared with a BG or GB parent) to be able to say 
that a boy was born during a particular minute on August 11th. because with 
two boys there are two chances to achieve this highly improbable characteristic. 
In contrast, a BB parent is no more likely (compared with a BG or GB parent) 
to be able to say simply that at least one child is a boy. 

2. In the other extreme where the given characteristic is “being born on any day,” 
we have p = 1. (This clearly isn't much of a characteristic, since it is satisfied by 
everyone.) So Eq. (2.110) gives Feb = 1/3. In this p = 1 case, only the entries 
in the top row in Table 2.8 have nonzero probabilities. We are therefore in the 
realm of the first scenario in Section 2.4.4, where we started off with the four 
types of families (BB, BG, GB. GG) and then ruled out the GG type, yielding a 
probability of 1/3. It makes sense that the 1/3 answer in the p = 1 case is the 
same as the 1/3 answer in the first scenario in Section 2.4.4. because the “being 
born on any day” statement provides no additional information. So the setup is 
equivalent to the first scenario in Section 2.4.4, where the parent provided no 
additional information (beyond the fact that one child was a boy). * 

2.20. A second test 

The relevant probability tree is obtained by simply tacking on one more iteration of 
branches to Fig. 2.11. The result is shown in Fig. 2.20. (We’ve again arbitrarily 
started with 1000 people.) We are concerned only with the two numbers 18.05 and 
9.8, because these are the only numbers associated with positive results for both tests 
(labeled as “++”). The desired probability is therefore 


P = 


18.05 


18.05 + 9.8 


= 64.8%. 


( 2 . 111 ) 


This is significantly larger than the result of 16% in the original example in Section 2.5. 

Note that since we are concerned only with two of the final eight numbers, there was 
actually no need to draw the entire probability tree. The two relevant numbers are 
obtained from the products. 


(1000) (0.02) (0.95) (0.95) = 18.05, 

(1000) (0.98) (0.1)(0.1) = 9.8. (2.112) 


These products make it clear how to proceed in the general case of n tests. If we 
perform n successive tests on each person, then the probability that a person who tests 
positive all n times actually has the disease is 


60.02)60.95)" 

(0.02)(0.95)" + (0.98)(0.1)" ' 


(2.113) 


If n = 1 then p = 0.16, as we found in the original example. If. say, n = 4, then 
p = 99.4. Here the smallness of the (0.1)" factor in Eq. (2.113) wins out over the 
smallness of the 0.02 factor. In this case, although not many people have the disease, 
the number of people who falsely test positive all four times is even smaller. If n is 
large, then p is essentially equal to 1. 
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disease 

+ + 
+ - 
- + 


+ + 
+ - 
- + 


no disease 

Figure 2.20: The probability tree for two tests. 




2.21. Bayes’ theorem for the prosecutor’s fallacy 

A given person is either innocent or guilty, and either fits the description or doesn’t. 
Our goal is to find P(I|D). From the second remark at the end of Section 2.5, we want 
the horizontal span of our square to be associated with the innocent and guilty possi¬ 
bilities, and the vertical span to be associated with the description or not-description 
possibilities. The result is shown in Fig. 2.21. This figure contains the same informa¬ 
tion as Fig. 2.10, but in rectangular instead of oval form. 


not fit description 
(999,900 in 999,999) 


fit description 
(99 in 999,999) 


innocent guilty 

(999,999 in 10 6 ) ( (1 in 10 6 ) 


fit description 
(1 in 1) 


Figure 2.21: The probability square for the prosecutor's fallacy. 


We haven’t draw things to scale, because if we did, both of the shaded rectangles 
would be too thin to see. The thin vertical rectangle represents the single guilty person, 
and the rest of the square represents the 999,999 innocent people. The guilty person 
fits the description, so the entire thin vertical rectangle is shaded. (As in Fig. 2.10, 
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the fourth possible group of people - guilty and not fitting the description - has zero 
people in it.) Only about 0.01% of the innocent people fit the description, so the darkly 
shaded rectangle is very squat. The desired probability B(I|D) equals the number of 
people in the darkly shaded region (namely 99) divided by the total number of people 
in both shaded regions (namely 99 + 1). So the desired probability P(I|D) of being 
innocent, given that the description is satisfied, equals 99/100 = 0.99. 

As we mentioned in Section 2.4.3, B(I|D) (which is close to 1) is not equal to P(D|I) 
(which is close to 0). The former is the ratio of the darkly shaded area to the total 
shaded area, while the latter is the ratio of the darkly shaded area to the area of the 
entire left vertical rectangle (the whole square minus the one guilty person). 

If you want to use the simple form of Bayes" theorem in Eq. (2.51), instead of using 
the probability square in Fig. 2.21, you can write 

99 999,999 

P(D|I)-/ , (I) 999 999 u )6 99 

P( D) J_ 100 v 

10 4 

as desired. You can verify that the various probabilities we used here are correct. 

2.22. Black balls and white balls 

Our goal is to calculate the probability that the box you pick is the one with two black 
balls, given that all n draws are black. We’ll denote this probability by P{B 2 \nB). The 
two possibilities for the box you pick are the B 2 box with two black balls, and the B\ 
box with one black ball. So with this notation, the general form (which is the same as 
the explicit form in this case) of Bayes’ theorem in Eq. (2.53) gives 


P(B 2 \nB) 


_ P{nB\B 2 )-P(B 2 ) _ 

P(nB\B 2 )-P(B 2 ) + P{nB\B\)-P(B\) ' 


(2.115) 


We are given that P(B \) = P(B 2 ) = 1/2, and we also know that P(nB\B 2 ) = 1 and 
P(nB\B\) = (1/2)". So Bayes’ theorem gives 


P(B 2 \nB) 


l-d/2) 

1 ■ ( 1 / 2 ) + ( 1 / 2 )"-( 1 / 2 ) 

1 _ 2 " 

1 + 1 / 2 " " 2 " + 1 ' 


(2.116) 


If n = 1, then P(B 2 \nB) = 2/3. And if n = 10, then P(B 2 \nB) = 1024/1025 ~ 
99.9%. 

If you want to solve the problem without explicitly using Bayes’ theorem, the math 
turns out to be essentially the same. Imagine doing a large number N of trials of the 
given process. On average, you will pick each of the two boxes N/2 times. All n 
draws will be black in all of the N/2 cases where you pick B 2 . But all n draws will be 
black in only 1/2" of the N/2 cases where you pick B\. The other 1 - 1/2" fraction 
of the cases (where you draw at least one white ball) aren't relevant here. You are 
therefore dealing with the B 2 box in N/2 of the N/2 + (N/ 2)/2" times that you draw 
n black balls. The desired probability is then 


P(B 2 \nB) = 


N/2 

N/2 + (N/ 2)/2" 

1 

1 + 1 / 2 " ’ 


(2.117) 


as above. 



Chapter 3 


Expectation values 


We begin this chapter by introducing in Section 3.1 the important concept of an 
expectation value. Roughly speaking, the expectation value is a fancy name for the 
average. In Section 3.2 we discuss the variance, which is a particular type of ex¬ 
pectation value related to the square of the result of a random process. Section 3.3 
covers the standard deviation, which is defined to be the square root of the vari¬ 
ance. The standard deviation gives a rough measure of the spread of the outcomes 
of a random process. A special kind of standard deviation is the standard deviation 
of the mean, discussed in Section 3.4. This is the standard deviation of the average 
of a particular number of trials of a random process. We will see that the standard 
deviation of the mean is smaller than the standard deviation of just one trial of a 
random process. This fact leads to the law of large numbers, which we will dis¬ 
cuss in detail in Chapter 5. Section 3.5 covers the sample variance, which gives a 
proper estimate (based on a sample set of numbers) of the true variance of a proba¬ 
bility distribution. This section is rather mathematical and can be skipped on a first 
reading. 


3.1 Expectation value 

Consider a variable that can take on certain numerical values with certain probabil¬ 
ities. Such a variable is appropriately called a random variable. For example, the 
number of Heads that can arise in two coin tosses is a random variable, and it can 
take on the values of 0, 1, and 2. A random variable is usually denoted with an 
uppercase letter, such as X, while the actual values that the variable can take on are 
denoted with lowercase letters, such as x. So we say, “The number of Heads that 
can arise in two coin tosses is a random variable X, and the values that X can take 
on are x\ — 0, X 2 — 1, and *3 = 2.” Note that the subscript here is an index starting 
with 1, and not the number of Heads. 

The possible outcomes of a random process must be numerical if we are to 
use the term “random variable.” So, for example, we don’t use the term “random 
variable” to describe the possible outcomes of a coin toss if these outcomes are 
Heads and Tails. But we do use this term if we assign, say, the number 1 to Heads 
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and the number 0 to Tails. Many of the examples in Chapter 2 involved random 
variables (for example, rolling dice or counting the number of Heads in a given 
number of coin tosses), even though we waited until now to define what a random 
variable is. 

The probabilities of the three possible outcomes for X in the above example 
of two coin tosses are P(x i) = 1/4, P(x' 2 ) - 1/2, and P(x 3 ) = 1/4, because the 
four possible outcomes (HH, HT, TH, TT) are all equally likely. The collection of 
these probabilities is called the probability distribution for X. We’ll talk at length 
about probability distributions in Chapter 4, but for now all you need to know is that 
a probability distribution is simply the collective information about how the total 
probability (which is always 1 ) is distributed among the various possible outcomes. 

The expectation value (or expected value) of a random variable X is the expected 
average obtained in a large number of trials of the process. So in some sense, the 
expectation value is just the average. However, these two terms have different us¬ 
ages. The average is generally associated with trials that have already taken place, 
for example: the average number of points per game a player scored in last year’s 
basketball season was 14. In contrast, the expectation value refers to the average 
that you would expect to obtain in trials yet to be carried out, for example: the ex¬ 
pectation value of the number of Heads you will obtain in 10 coin tosses is 5. A 
third word meaning roughly the same thing is the mean. This can be used in either 
of the above two contexts (past or future trials). 

The expectation value (or the mean) of a random variable X is denoted by either 
E(X) or px ■ However, if there is only one random variable at hand (and hence no 
possibility of confusion), we often don’t bother writing the subscript X in px- So 
the various notations we’ll use are: 

Expectation value: E(X) = px = p ■ (3.1) 

As an example of an expectation value, consider the roll of a die. Since the 
numbers 1 through 6 are all equally probable, the expectation value is just their 
average, which is (1 + 2 + 3+ 4 + 5+6)/6 = 3.5. Of course, if you roll one die, 
there is no chance that you will actually obtain a 3.5, because you can roll only 
the integers 1 through 6 . But this is irrelevant as far as the expectation value goes, 
because we’re concerned only with the expected average value of a large number 
of trials. An expectation value of 3.5 is simply a way of saying that if you roll a die 
1000 times and add up all the results, you should get a total of about 3500. Again, 
it is extremely unlikely (but not impossible in this case) that you will get a total of 
exactly 3500, but this doesn’t matter when dealing with the expectation value. 

The colloquial use of the word “expected” can cause some confusion, because 
you might think that the expected value is the value that is most likely to occur. This 
is not the case. If we have a process with four equally likely outcomes, 1,2,2,7, 
then even though 2 is the most likely value, the “expected value” is the average of 
the numbers, which is 3, which never occurs. 

In order for an expectation value to exist, we need each possible outcome to be 
associated with a number, as is always the case for a random variable, by definition. 
If there are no actual numbers involved, then it is impossible to form the average 
(or actually the weighted average ; see Eq. (3.4) below). For example, let’s say we 
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draw a card from a deck, and let’s assume that we’re concerned only with its suit. 
It makes no sense to talk about the expected value of the suit, because it makes no 
sense to take an average of a heart, diamond, spade, and club. If, however, we assign 
“suit values” of 1 through 4, respectively, to these suits (so that we’re now dealing 
with an actual random variable - the suit value), then it does make sense to talk 
about the expected value of the suit value, and it happens to be 2.5 (the average of 1 
though 4). 

The above example with the rolled die consisted of six equally likely outcomes, 
so we found the expectation value by simply taking the average of the six outcomes. 
But what if the outcomes have different probabilities? For example, what if we have 
three balls in a box, two labeled with a “1” and one labeled with a “4”? If we pick 
a ball, what is the expectation value of the resulting number? (We’ll denote this 
number by the random variable X.) 

To answer this, imagine performing a large number of trials of the process. Let’s 
be general and denote this large number by n. Since the probability of picking a 1 
is 2/3, we expect about (2/3 )n of the numbers to be a 1. Likewise, about (l/3)« 
of the numbers should be a 4. The total sum of all the numbers should therefore be 
about (2/3 )n • 1 + (1/3 )n ■ 4. To obtain the expected average, we just divide this 
result by n, which gives 


E(X) = 


(2/3)n ■ 1 + (l/3)n-4 
n 



■4 = 2. 


(3.2) 


Note that the n s canceled out, so the result is independent of n. This is how it should 
be, because the expected average value shouldn’t depend on the exact hypothetical 
number of trials you do. 

In general, if a random variable X has two possible outcomes xi and X 2 instead 
of 1 and 4, and if the associated probabilities are p\ and pi instead of 2/3 and 1/3, 
then the same reasoning as above gives the expectation value as 


E(X) = 


(pui)-xi + (pin)-X2 
n 


= P\X\ +P2X2- 


(3.3) 


What if we have more than two possible outcomes? The same reasoning works 
again, but now with more terms in the sum. You can quickly verify (by again imag¬ 
ining a large number of trials, n) that if the outcomes are X\, X2, ..., x m , and if the 
associated probabilities are pi, P 2 , ■■■, p m , then the expectation value is 


E(X) = p\X i + P2X2 + ■ • ■ + Pm Xm 


(3.4) 


This is called the weighted average of the outcomes, because each outcome is 
weighted (that is, multiplied) by its probability. This weighting has the effect of 
making outcomes with larger probabilities contribute more to the expectation value. 
This makes sense, because these outcomes occur more often, so they should influ¬ 
ence the average more than outcomes that occur less often. 

Eq. (3.4) involves a discrete sum, because we’re assuming here that our random 
variable takes on a discrete set of values. If we have a continuous random vari¬ 
able (we’ll discuss these in Chapter 4), then the sum in Eq. (3.4) is replaced by an 
integral. 
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Example 1 (Expected number of Heads): If you flip a coin four times, what is the 
expected value of the number of Heads you obtain? 


Solution: Without doing any work, we know that the expected number of Heads is 2, 
because half of the coins will be Heads and half will be Tails, on average. 

We can also solve the problem by using Eq. (3.4). By looking at the 16 equally likely 
outcomes in Table 1.6 in Section 1.3, the probabilities of obtaining 0, 1, 2, 3, or 4 
Heads are, respectively, 1/16, 4/16, 6/16, 4/16, and 1/16. So Eq. (3.4) gives the 
expectation value of the number of Heads as 


14 6 „ 4 1 32 

— • 0 + — ■ 1 + — • 2 + — ■ 3 + — ■ 4 — — — 2. 

16 16 16 16 16 16 


(3.5) 


Example 2 (Flip until Heads): If you flip a coin until you get a Heads, what is the 
expected total number of coins you flip? 


Solution: There is a 1/2 chance that you immediately get a Heads, in which case you 
flip only one coin. There is a 1/2 ■ 1/2 = 1/4 chance that you get a Tails and then a 
Heads, in which case you flip two coins. There is a 1/2 • 1/2 • 1/2 = 1/8 chance that 
you get a Tails, then another Tails, then a Heads, in which case you flip three coins. 
And so on. The expectation value of the total number of coins is therefore 


- ■ 1 + - -2+ - -3+ — -4+ — -5 + - 
2 4 8 16 32 


(3.6) 


This sum has an infinite number of terms, although they eventually become negligibly 
small. The sum is a little tricky to calculate; see Problem 3.1. However, if you use 
a calculator to add up the first dozen or so terms, it becomes clear that the sum ap¬ 
proaches 2. You are encouraged to convince yourself of this result experimentally, by 
doing a reasonably large number of trials, say, 50. 


Let’s now prove a handy theorem involving the sum of two random variables, 
although you might think the theorem is so obvious that there’s no need to prove it. 

Theorem 3.1 The expectation value of the sum of two random variables equals the 
sum of the expectation values of the two variables. That is, 


E(X + Y) = E(X) + E(Y) 


(3.7) 


Proof: Imagine performing a large number n of trials to experimentally determine 
E(X + Y). Each trial involves picking values of X and Y and then forming their sum 
X + Y. That is, you pick values X\ and y \ and form the sum x\ + y \ . Then you pick 
values X 2 and V 2 and form the sum xi + yi- You keep doing this a total of n times, 
where n is large. In the n —> oo limit, the average value that you obtain for X + Y 



3.1. Expectation value 


137 


equals the expectation value of X + Y. So (with the n —> oo limit understood) 


E(X + Y) 


- Y.Oi + >■() 

n ^ 

£(X) + £(T), 


(3.8) 


as desired. ■ 

This theorem is intuitive. X simply contributes E(X) to the average, and Y con¬ 
tributes E(Y). Note that we made no assumption about the independence of X and 
Y in the proof. They can be independent or dependent, and the theorem still holds. 

Having just used the word “independent,” we should define what we mean by 
this. Two variables are independent random variables if the value of one variable 
doesn’t affect the probability distribution of the other. For example, if X and Y are 
the results of the rolls of two dice, then X and Y are independent. If you know that 
the left die shows a 5, then the probability distribution for the right die still consists 
of six equal probabilities of 1 /6. Mathematically, the random variables X and Y are 
independent if 


P(x\y) = P(x) (independent random variables), (3.9) 

for any values of x and y. Likewise with X and Y switched. (More formally, 
Eq. (3.9) can be written as P(X = x\Y = y ) = P(X = x).) This definition of in¬ 
dependent random variables is similar to the definition of independent events given 
near the start of Section 2.2.1 and in Eq. (2.12). But the definition for random 
variables is more general. Two variables are independent if any event (that is, any 
outcome or set of outcomes) associated with one variable is independent of any 
event associated with the other variable. Alternatively, we can say that two random 
variables X and Y are independent if 

P(x,y) — P(x)P(y) (independent random variables), (3.10) 

for any values of x and y. (More formally, Eq. (3.10) can be written as PiX = 
x and Y = y) = P(X = x) ■ P(Y = y).) The equivalence of Eqs. (3.9) and (3.10) is 
exactly analogous to the equivalence of Eqs. (2.12) and (2.13). 


Example: Let X take on the values 1 and 2 with equal probabilities of 1/2, and let 

Y take on the values 1, 2, and 3 with equal probabilities of 1/3. Assume that X and 

Y are independent. Find E(X + Y) by explicitly using Eq. (3.4), and then verify that 
Eq. (3.7) holds. 

Solution: We first quickly note that E(X ) = 1.5 and E(Y) = 2. To use Eq. (3.4) to 
calculate E(X + Y), we must first determine the various p,- probabilities. If X = 1, 
the three possible values of X + Y are 2, 3, and 4. And if A = 2, the three possible 
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values of X + Y are 3, 4, and 5. Because X and Y are independent, all six of these 
combinations have probabilities of (1/2)(1/3) =1/6, from Eq. (3.10). So we have 

P{ 2) = ^, P(3) = ^, B(4) = |. ^P(5) = i . (3.11) 

Eq. (3.4) then gives the expectation value of X + Y as 

12 2 1 21 

E(X + Y) = -■ 2+--3+ = -4+--5= — = 3.5. (3.12) 

6 6 6 6 6 

This is indeed equal to E(X) + E(Y) = 1.5 + 2 = 3.5, as Eq. (3.7) claims. 

If X and Y are instead dependent , then we can’t apply Eq. (3.4) without being told 
what the dependence is, because there is no way to determine the /?,•’s in Eq. (3.4) 
without knowing the specific dependence. But E(X + Y) = E(X ) + E(Y) will still 
hold in any case. See Problem 3.3 for an example. 


Using the same kind of reasoning as in the proof of Theorem 3.1, you can 
quickly show that 


E(aX + bY + c) = aE(X) + bE(Y) + c, (3.13) 

where a, b. and c are numerical constants. The result in Theorem 3.1 is the special 
case where a = 1, b = 1, and c = 0. Likewise, similar reasoning in the case of many 
random variables gives 

E(a\X\ + ayX -2 + ■ ■ ■ + a u X n ^) — ayE(X\^ + aoE(X 2 ) + • ■ • + a u E(X n ), (3.14) 

as you would expect. You can add on a constant c here, too. 

A special case of Eq. (3.14) arises when we perform n trials of the same process. 
In this case, the n random variables X ,■ are all associated with the same probability 
distribution. That is, the Xj are identically distributed random variables. For ex¬ 
ample, the X t might all refer to the rolling of a die. With each a, chosen to be 1, 
Eq. (3.14) then implies 


E(X 1 +X 2 + --+X n )=nE(X). (3.15) 

We could just as well pick any particular i and write E(Xj) on the righthand side 
here. The expectation values E(Xi) are all equal, because the Y, are all associated 
with the same probability distribution. But for simplicity we are using the generic 
letter X to stand for the random variable associated with the given probability dis¬ 
tribution. 

Remark: A word on notation: Since the X, all come from the same distribution (that is, they 
are identically distributed), it is tempting to replace all of them with the same letter (say, X) 
and write the lefthand side of Eq. (3.15) as E(nX). This is incorrect. The random variable 
nX is not the same as the random variable Xy + X 2 + ■ ■ ■ + X„. The former involves picking 
one value from the given distribution and multiplying it by n. whereas the latter involves 
picking n different (or at least generally different) values and adding them up. The results 
of these processes are not the same. They do happen to have the same expectation value, so 
Eq. (3.15) would still be true with E(nX) on the lefthand side. But the two processes have 
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different spreads of the values around the common expectation value nE(X), as we’ll see in 
Section 3.2. Also, if you roll ten dice, for example, then nX must be a multiple of 10 (from 
10 to 60), whereas the sum of ten Xf values can be any integer from 10 to 60. * 

You often apply the result in Eq. (3.15) without even knowing it. For example, 
let’s say we flip a coin and define our random variable X to be 1 if we get Heads 
and 0 if we get Tails. These occur with equal probabilities, so the expectation value 
of X is E(X) = (1/2) • 1 + (1/2) • 0 = 1/2. If we then flip 100 coins, Eq. (3.15) tells 
us that the expectation value of X\ + X 2 + ■ ■ ■ + A - 100 (that is, the expected number 
of Heads in the 100 flips) is 100 ■ E(X) = 50, which is probably what you would 
have thought anyway, without using Eq. (3.15). 

However, you shouldn’t get carried away with this type of reasoning, because 
Eq. (3.14) holds only for linear combinations of the random variables. It is not true, 
for example, that E(l/X) = 1 /E(X) or that E(X 2 ) = (E(X)) 2 . You can verify 
these non-equalities in the case where X is the result of a die roll. You can show 
that E(l/X) « 0.41, whereas 1 /E(X) = 1/(3.5) « 0.29. Similarly, you can show 
that E(X 2 ) * 15.2, whereas (E(X)) 2 = 3.5 2 = 12.25. 

Theorem 3.1 and its corollaries deal with sums of random variables. Let’s now 
prove a theorem involving the product of random variables. 


Theorem 3.2 The expectation value of the product of two independent random vari¬ 
ables equals the product of the expectation values of the two variables. That is, 


E(XY) = E(X) ■ E(Y ) 


(independent variables) 


(3.16) 


Note that this theorem (concerning the product XY) requires that X and Y be inde¬ 
pendent, unlike Theorem 3.1 (concerning the sum X + Y). 

Proof: The product XY is itself a random variable, and it takes on the values x 
where i runs through the nx possible values of X , and j runs through the ny possible 
values of Y. There are therefore nxny possible values of the product x,- y,;. Starting 
with Eq. (3.4) and then applying Eq. (3.10), the expectation value of XY is 


n X n Y 

E(XY) = zz P(xi,yj) • xtyj 

1=1 j = 1 
n x n Y 

= J] 2 p(x i)P{yf) ■ xw 

i = l j =1 

( nx \I nY \ 

Y^P{xi) ■ Xi^Ynyj) ■ yA 

= E(X) ■ E(Y). (3.17) 

The use of Eq. (3.10), which is valid only for independent random variables, is what 
allowed us to break up the sum in the first line here into the product of two separate 


sums. ■ 
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Example: Let X be the result of a coin flip where we assign the value 2 to Heads and 
1 to Tails. And let Y be the result of another (independent) coin flip where we assign 
the value 4 to Heads and 3 to Tails. Then E(X) = 3/2 and E{Y) = 7/2. 

Let's explicitly calculate E{XY), to show that it equals E(X)E(Y). There are four 
equally likely outcomes for the random variable XY: 

2-4 = 8, 2-3 = 6, 1-4 = 4, 1-3 = 3. (3.18) 

E{XY) is the average of these numbers, so E{XY) = 21/4. And this is indeed equal 
to the product E(X)E(Y), as Eq. (3.16) claims. 

As an example of a setup involving dependent random variables, where Eq. (3.16) 
does not hold, consider again the above two coins. But let's now stipulate that the 
second coin always shows the same side as the first coin. So the values of 2 and 4 are 
always paired together, as are the values 1 and 3. There are now only two (equally 
likely) outcomes for XY, namely 2-4 = 8 and 1-3 = 3. The expectation value of XY 
is then 11 /2, which is not equal to E{X)E(Y) = 21/4. 


The expectation value plays an important role in betting and decision making, 
because it is the amount of money you should be willing to pay up front in order 
to have a “fair game.” By this we mean the following. Consider a game in which 
you can win various amounts of money, based on the various possible outcomes. 
For example, let’s say that you roll a die and that your winnings equal the resulting 
number (in dollars). How much money should you be willing to pay to play this 
game? Also, how much money should the “house” (the people running the game) be 
willing to charge you for the opportunity to play the game? You certainly shouldn’t 
pay, say, $6 each time you play it, because at best you will break even, and most of 
the time you will lose money. On average, you will win the average of the numbers 
1 through 6, which is $3.50. So this is the most that you should be willing to pay 
for each trial of the game. If you pay more than this, then you will lose money on 
average. Conversely, the “house” should charge you at least $3.50 to play the game 
each time, because otherwise it will lose money on average. 

Putting these two results together, we see that $3.50 is the amount the game 
should cost if the goal is to have a fair game, that is, a game where neither side wins 
any money on average. Of course, in games run by casinos and such, things are 
arranged so that you pay more than the expectation value. So on average the house 
wins, which is consistent with the fact that casinos stay in business. 

Note the italics in the previous paragraph. These are important, because when 
real-life considerations are taken into account, there might very well be goals that 
supersede the goal of having a fair game. The above discussion should therefore 
not be taken to imply that you should always play a game if the fee is smaller than 
the expectation value, or that you should never play a game if the fee is larger than 
the expectation value. It depends on the circumstances. See Problem 3.4 for a 
discussion of this. 
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3.2 Variance 

In the preceding section, we defined the expectation value E(X) as the expected 
average value obtained in many trials of a random variable X. In addition to E{X), 
there are other expectation values that are associated with a random variable X. 
For example, we can calculate E{X 2 ), which is the expectation value of the square 
of the value of X. If we’re rolling a die, the square of the outcome can take on 
the values of l 2 , 2 2 , ..6 2 (all equally likely). E{X 2 ) is the average of these six 
values, which is 91/6 = 15.17. We can also calculate other expectation values, such 
as E(X 7 ) or E(2X 3 - 8X 5 ), although arbitrary ones like these aren’t of much use. 

A slight modification of E(X 2 ) that turns out to be extremely useful in proba¬ 
bility and statistics is the variance. It is denoted by Var(A) and defined to be 


Var (X) = E[(X-^) 2 ] 


(where /a = ZsfA]) 


(3.19) 


In words: the variance of a random variable X is the expectation value of the square 
of the difference between X and the mean // (which itself is the expectation value of 
X). We’re using // here (without bothering with the subscript X) instead of E(X), 
to make the above equation and future ones less cluttered. 

When calculating the variance E\(X - fi) 2 ], Eq. (3.4) still applies. It’s just that 
the X values are replaced with the (X - /j) 2 values. E\(X - fj) 2 \ is the same type 
of quantity as E(X 2 ), except that we’re measuring the values of X relative to the 
expectation value /./. That’s what we’re doing when we take the difference X - ji. 
The examples below should make things clear. 

In addition to “Var(A),” the variance is also denoted by <r 2 x (or just cr 2 ), due 
to the definition of the standard deviation, cr, below in Section 3.3. When talking 
about the variance, sometimes people say “the variance of a random variable,” and 
sometimes they say “the variance of a probability distribution.” These mean the 
same thing. 


Example 1 (Die roll): The expectation value of the six equally likely outcomes of a 
die roll is p = 3.5. The variance is therefore 


Var(X) = E[(X - 3.5) 2 ] 


= 7 [(1 - 3.5) 2 + (2 - 3.5) 2 + (3 - 3.5) 2 
6 L 

+ (4 - 3.5) 2 + (5 - 3.5) 2 + (6 - 3.5) 2 ] 

= - [6.25 + 2.25 + 0.25 + 0.25 + 2.25 + 6.25] 
6 

= 2.92. 


(3.20) 
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Example 2 (Coin flip): Consider a coin flip where we assign the value 1 to Heads and 
0 to Tails. The expectation value of these two equally likely outcomes is p = 1 /2, so 
the variance is 


Var(X) = E\(X - 1/2) 2 ] 

= 1[(1- 1/2) 2 + (0- 1/2) 2 ] = 1. (3.21) 


Example 3 (Biased coin): Consider a biased coin, where the probability of getting 
Heads is p and the probability of getting Tails is 1 — p = q. If we again assign the 
value 1 to Heads and 0 to Tails, then the expectation value is p = p ■ 1 + (1 —p) ■ 0 = p. 
The variance is therefore 

Var (X) = E[(X - p) 2 ] 

= P • (1 - P) 2 + (1 -p) • (0 - p) 2 
= p(l-p)[(l-p)+p] 

= p(\-p)=pq- (3-22) 


As you can see in the above examples, the steps in finding the variance are: 

1. Find the mean. 

2. Find all the differences from the mean. 

3. Square each of these differences. 

4. Find the expectation value of these squares. 

The variance of a random variable is related to how much the outcomes are 
spread out away from the mean. Note well that the variance in Eq. (3.19) involves 
first squaring the differences from the mean, and then finding the expectation value 
of these squares. If instead you first find the expectation value of the differences 
from the mean, and then square the result, you will obtain zero. This is true because 

(E(X - p)) 2 = (E(X) - p) 2 = (p- p) 2 = 0. (3.23) 


We would obtain zero here even without the squaring operation, of course. 

The variance depends only on the spread of the outcomes relative to the mean, 
and not on the mean itself. For example, if we relabel the faces on a die by adding 
100 , so that they are now 101 through 106, then the mean changes significantly to 
103.5. But the variance remains at the 2.92 value we found in the first example 
above, because all of the differences from the mean are the same as for a normal 
die. 

If a is a numerical constant, then the variance of aX equals t/ 2 Var(A). This 
follows from the definition of the variance in Eq. (3.19), along with the result in 
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Eq. (3.13). The latter tells us that E(aX ) = aE(X ) = ap. so the former (along with 
another application of the latter) gives 

Var(flX) = E^aX) - (ap)] j = £[a 2 (X - p) 2 ] 

= a 2 E[(X - p) 2 ] = a 2 Var(X), (3.24) 

as desired. 

The variance of the sum of two independent variables turns out to be the sum 
of the variances of the two variables, as we show in the following theorem. Due to 
the nonlinearity of X in E\(X - p) 1 ], it isn’t so obvious that the variances should 
simply add linearly. But they indeed do. 


Theorem 3.3 Let X and Y be two independent random variables. Then 


Var(X + Y) = Var(X) + Var(T) 


(independent variables) 


(3.25) 


Proof: We know from Eq. (3.7) that the mean of X + Y is px + Py- So 

Var(X + Y) = £[((X + Y) - (p x + p Y )f] (3.26) 

= E[((X - p x ) + (Y - p Y )f] 

= E[(X-p x ) 2 ] +2E[(X - p x ){Y - p Y )]+ E[{Y - p Y ) 2 ] 

= Var(X) + 0 + Var(F). 

The zero here arises from the fact that X and Y (and hence X - px and Y - py ) are 
independent variables, which from Eq. (3.16) implies that the expectation value of 
the product equals the product of the expectation values. That is, 

E[(X - p x )(Y - p Y )] = E(X - p x ) ■ E(Y - p Y ) 

= (EQ O - px) ■ (E(Y) - p Y ) 

= (dx - px) ■ (py - Py) 

= 0. ■ (3.27) 


Example (Two coins): Let’s verify that Eq. (3.25) holds if we define X and Y to each 
be the result of independent coin flips where we assign the value 1 to Heads and 0 to 
Tails. The random variable X + Y takes on the values of 0, 1, and 2 with probabilities 
1/4, 1/2, and 1/4, respectively. The expectation value of X + Y is 1, so the variance is 

Vara + Y) = J [(0 - l) 2 ] + \ [(1 - l) 2 ] + \[Q- l) 2 ] = l - . (3.28) 

And we know from Eq. (3.21 ) that the variance of each single coin flip is Var(A) = 
Var(K) = 1/4. So it is indeed true that Var(X + Y) = Var(A) + Var(L). 

As an example of a setup involving dependent random variables, where Eq. (3.25) 
does not hold, consider again the above two coins. But let’s now stipulate that the 
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second coin always shows the same side as the first coin. So the 1 ’s are always paired 
together, as are the 0’s. There are then only two (equally likely) outcomes for X + Y, 
namely 0 + 0 = 0 and 1 + 1=2. The expectation value of X + Y is 1, so the variance is 

Var(A + Y) = '-[(0- 1) 2 ] + [(2 - l) 2 ] = 1, (3.29) 

which is not equal to Var(A) + Var(T) = 1/2. 


Repeated application of Eq. (3.25) gives the variance of the sum of an arbitrary 
number of independent variables as 

Var(2f! + X2 + ■ ■ ■ + X„) = Var(Ai) + Var(A 2 ) + • • • + Var(A„). (3.30) 

By “repeated application” we mean the following. Let the Y in Eq. (3.25) be equal 
to X n , and let the X be the sum of X\ through X n -\. This gives 

Var(Xi + X 2 + • ■ ■ + X„) = Var(A, + X 2 + • ■ • + X„_i) + Var(X„). (3.31) 


Then repeat the process with Y = X n -i and with X equal to the sum of X\ through 
X n - 2 . And so on. This eventually yields Eq. (3.30). 

If all of the Xj are independent and identically distributed random variables 
(i.i.d. variables, for short), then Eq. (3.30) gives 


Var(Xi + X2 + ■ ■ ■ + X n ) — nVar(A) 


(i.i.d. variables) 


(3.32) 


where X represents any one of the Xj. For example, we can flip a coin n times 
and write down the total number of Heads obtained. (In doing this, we’re effec¬ 
tively assigning the value 1 to Heads and 0 to Tails.) This sum of n independent 
and identically distributed coin flips is the binomial process we discussed in Sec¬ 
tion 1.8. Since we know the “1/4” result in Eq. (3.21) for the variance of a single 
flip, Eq. (3.32) gives the variance of the binomial process as nW ar(A) = n/4. More 
generally, if we have a biased coin with P(Heads) = p and R(Tails) = 1 - p = q, 
then the combination of Eqs. (3.22) and (3.32) tells us that the variance of the num¬ 
ber of Heads in n flips is 


Var(Heads in n flips) = npq 


(biased coin) 


(3.33) 


Remark: As mentioned in the remark following Eq. (3.15), the sum X\ + X 2 + • • • + X n 
in Eq. (3.32) is not the same as nX. Although the random variables Xj are all identically 
distributed, that certainly doesn’t mean that their values are identical. The values of the Xj 
will generally be different. So when forming the sum, we can’t just take one of the values and 
multiply it by n. Although the expectation-value statement in Eq. (3.15) happens to remain 
true if we replace the sum X[ + A 2 + • ■ ■ + X n with 11 X, the variance statement in Eq. (3.32) 
does not remain true. From Eq. (3.24), the variance of nX equals n 2 Var(X), which isn’t the 
same as the nVar(A) result in Eq. (3.32). * 


When dealing with the product of two random variables, it turns out that the 
equation analogous to Eq. (3.16) for expectation values does not hold for vari¬ 
ances, even if X and Y are independent. That is, it is not true that Var(AT) = 
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Var(A)Var(T). See Problem 3.6 for an example showing that this equality doesn’t 
hold. 

It is often useful to write the variance, which we defined in Eq. (3.19), in the 
following alternative form: 


Var(X) = E(X 2 ) - /.i 2 


(3.34) 


That is, the variance equals the expectation value of the square, minus the square of 
the expectation value. This can be demonstrated as follows. Starting with Eq. (3.19), 
we have 


Var (X) = E[(X - rf] 

= E[X 2 - 2/j ■ X + /u 2 ] 

= E(X 2 ) -2 n-E(X) + i? 

= E(X 2 )- 2/r + n 2 

= E(X 2 ) - ir, (3.35) 

as desired. We have used the fact that E(X) means the same thing as fJ. And we 
have used Eq. (3.13) (which says that the expectation value of the sum equals the 
sum of the expectation values) to go from the second line to the third line. You 
can quickly verify that this expression for Var(A) gives the same variances that we 
found in the three examples near the beginning of this section; see Problem 3.7. 

Variance of a set of numbers 

In the above discussion of the variance, the definition in Eq. (3.19) was based on 
a random variable X with a given probability distribution. We can, however, also 
define the variance for an arbitrary set of numbers, even if they don’t have any¬ 
thing to do with a probability distribution. Given an arbitrary set S of n numbers, 
x\,... ,x n , let their average (or mean) be denoted by x. We’ll also occasionally use 
(x) to denote the average: 


Average : 



Then the variance of the set S is defined to be 


(3.36) 


1 " 

Var(S) = - V (x{ -x) 2 
n 


(for a set S of numbers) 


(3.37) 


In words: the variance of the set S is the average value of the square of the difference 
from the mean. Note the slight difference between the preceding sentence and the 
sentence following Eq. (3.19). That sentence involved the “expectation value of 
the square...,” whereas the present sentence involves the “average value of the 
square....” This distinction is due to the fact that (as we noted near the beginning of 
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Section 3.1) the term “expectation value” is relevant to a probability distribution for 
a random variable X. If you are instead simply given a set S of numbers, then you 
can take their average, but it doesn’t make sense to talk about an expectation value, 
because there are no future trials for which you can expect anything. (Technically, 
if you are imagining that the set S of numbers came from a probability distribution, 
then you can talk about the best guess for the expectation value of the distribution. 
But we won’t get into that here.) 

As an example, if we have the set S of four numbers, 2.3, 5.6, 3.8, and 4.7, then 
the average is 4.1, so the variance is 

Var(S) = ^ [(2.3 - 4.1) 2 + (5.6 - 4.1) 2 + (3.8 - 4.1) 2 + (4.7 - 4.1) 2 ] 

= 1.485. (3.38) 

Note that all of the numbers are weighted equally here. This isn’t the case (in 
general) when calculating the variance in Eq. (3.19). 

Later on in Section 3.5 we’ll encounter a slightly modified version of Eq. (3.37) 
called the “sample variance,” which has an n— 1 instead of an n in the denominator. 


3.3 Standard deviation 

The standard deviation of a random variable (or equivalently, of a probability dis¬ 
tribution) is defined to be the square root of the variance: 


<xx = VVar(A) 


(3.39) 


As with the mean p, the subscript X is usually dropped if there is no ambiguity 
about which random variable we are referring to. With the definition in Eq. (3.39), 
we can write the variance as cr^. You will often see this notation for the variance, 
since it is quicker to write than Var(A), and even quicker if you drop the subscript 
X. Like the variance, the standard deviation gives a rough measure of how much 
the outcomes are spread out away from the mean. We’ll draw some pictures below 
that demonstrate this. 

From Eqs. (3.19) and (3.34), we can write the standard deviation in two equiva¬ 
lent ways: 


cr= ^E[(X-p) 2] = ^E(X 2) - p 2 


(3.40) 


Using the first of these forms, the steps in finding the standard deviation are the 
same as in finding the variance, with a square root tacked on the end: 


1. Find the mean. 

2. Find all the differences from the mean. 

3. Square each of these differences. 
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4. Find the expectation value of these squares. 

5. Take the square root of this expectation value. 

As with the variance, the standard deviation depends only on the spread relative 
to the mean, and not on the mean itself. If we relabel the faces on a die by adding 
100, so that they are now 101 through 106, then the mean changes significantly to 
103.5, but the standard deviation remains at V2.92 = 1.71 (using the 2.92 value for 
the variance in Eq. (3.20)). 

Since the standard deviation is simply the square root of the variance, we can 
quickly translate all of the statements we made about the variance in Section 3.2 
into statements about the standard deviation. Let’s list them out. 


• From Eq. (3.24) the standard deviation of aX is just a times the standard 
deviation of X: 


(f a x - acx- 


(3.41) 


• If X and Y are two independent random variables, then Eq. (3.25) becomes 


X+Y 


(T- x +(T l Y 


(independent variables) 


(3.42) 


This is the statement that standard deviations “add in quadrature” for inde¬ 
pendent variables. 


• The more general statement in Eq. (3.30) can similarly be rewritten as (again 
only for independent variables) 


<T x 1 +x 2 +-+x n 


t , 2 

°~x t + v Xl 


+ crt 


(3.43) 


Taking the square root of Eq. (3.43) gives (again only for independent vari¬ 
ables): 


a-xi+x 2 +-+x n 



+ <4 + - 


+ crt 


(3.44) 


• If all of the Xi are independent and identically distributed random variables, 
then Eq. (3.44) becomes 


crxx+x 2 +-+x n - yfncrx 


(i.i.d. variables) 


(3.45) 


• From Eq. (3.22) the standard deviation of a single flip of a biased coin (with 
Heads equalling 1 and Tails equalling 0) is 

cr - \[pq. (3.46) 


• If we flip the biased coin n times, then from either Eq. (3.33) or Eq. (3.45), 
the standard deviation of the number of Heads is 


cr = yjnpq 


(n biased coins) 


(3.47) 
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For a fair coin (p — q — 1 /2), this equals 



(n fair coins) 


(3.48) 


For example, the standard deviation of the number of Heads in n = 100 fair 
coin flips is cr = V100/4 = 5. This is a handy fact to remember. The standard 
deviations for other numbers of flips can then quickly be determined by using 
the fact that cr is proportional to \JTi. For example, 1000 is 10 times 100, and 
VlO ~ 3, so the cr for n = 1000 flips is about 3-5 = 15. (It’s actually more 
like 16.) Similarly, 10,000 is 100 times 100, and V100 = 10, so the cr for 
n = 10,000 flips is 10 ■ 5 = 50. 


• In terms of cr, Eq. (3.34) becomes 

cr 2 = E(X 2 ) - p 2 . (3.49) 


If we solve for E ( X 2 ) here, we see that the expectation value of the square of 
a random variable X is 


E(X 2 ) = CT 2 + p 2 


(3.50) 


This result checks in two limits. First, if p = 0 then Eq. (3.50) says that cr 2 
(which is the variance) equals E(X 2 ). This agrees with what Eq. (3.19) says 
when p equals zero. Second, if cr = 0 then Eq. (3.50) says that E(X 2 ) equals 
p 2 . This makes sense, because if cr = 0 then there is no spread in the possible 
outcomes. That is, there is only one possible outcome, which must then be p, 
by definition; the expectation value of one number is simply that number. So 
E(X 2 ) = pr. 


As mentioned above, the standard deviation (like the variance) gives a rough 
measure of how much the outcomes are spread out away from the mean. This mea¬ 
sure is actually a much more appropriate one than the variance’s measure, because 
whereas the units of the variance are the same as X 2 , the units of the standard devi¬ 
ation are the same as X. It therefore makes sense to draw the standard deviation in 
the same figure as the plot of the probability distribution for the various outcomes 
(with the X values lying on the horizontal axis). We’ll talk much more about plots 
of probability distributions in Chapter 4, but for now we’re concerned only with 
what the standard deviation looks like when superimposed on the plot. 


Example: Fig. 3.1 shows four examples of the standard deviation superimposed on 
the probability distribution. The commentary on each plot is as follows. 

• First plot: For a die roll, the probability of each of the six numbers is 1/6. And 
since the variance in Eq. (3.20) is 2.92, the standard deviation is cr = V2.92 = 
1.71. This is the rough spread of the outcomes, relative to the mean (which is 
3.5). Some outcomes he inside the range of ±<r around the mean, and some lie 
outside. 
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Figure 3.1: Four different probability distributions and standard deviations. 


• Second plot: For a fair coin flip (with Fleads = 1 and Tails = 0), Eq. (3.48) 
gives the standard deviation as a = 1/2. Both outcomes therefore lie right at 
the ±cr locations relative to the mean (which is 1/2). This makes sense; all of 
the outcomes (there are only two of them) are a distance of 1/2 from the mean. 

• Third plot: For a biased coin flip (again with Heads = 1 and Tails = 0), we have 
assumed that the probabilities are p = 1/10 for Heads and 1 - p = 9/10 = g for 
Tails. So Eq. (3.46) gives the standard deviation as cr = V(l/10)(9/10) = 3/10. 
As noted prior to Eq. (3.22), the mean of the roll is p. which is 1/10 here. The 
outcome of 0 lies inside the range of ±cr around the mean, while the outcome 
of 1 lies (far) outside. 


• Fourth plot: For n flips of a fair coin, Eq. (3.48) gives the standard deviation of 
the number of Heads as 


cr = 



(3.51) 


If we pick n to be 20, then we have cr = V20/2 = 2.24. Five outcomes lie 
inside the range of ±cr around the mean (which is 10), while the other 16 lie 
outside. Although there are more outcomes outside, and additionally some of 
them are far away from the mean, their probabilities are small, so they don’t 

have an overwhelming influence on cr = JE[(X - p)^]. 
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In all of the above cases, cr gives a rough measure of the spread of the outcomes. 
More precisely, Eq. (3.40) tells us that cr is the square root of the expectation value 
of the square of the distance from the mean. This is a mouthful, so you might be 
wondering - if we want to get a rough idea of the spread of the various outcomes, 
why don’t we use something simpler? For example, we could just calculate the 
expected distance from the mean, that is, E{\X - p\). The absolute value bars 
here produce the various distances (which are nonnegative quantities, by definition). 
Although this is a perfectly reasonable definition, it is also a messy one. Quantities 
involving absolute values are somewhat artificial, because if \x — p\ is negative, then 
we have to throw in a minus sign by hand and say that \x - p\ — — (jt - p). In 
contrast, the square of a quantity (which always yields a nonnegative number) is a 
very natural thing. Additionally, the standard deviation defined by Eq. (3.39) has 
some nice properties, one of which is Eq. (3.42). An analogous statement (with or 
without the squares) wouldn’t hold in general if we defined cr as E(\X - p\). A 
quick counterexample is provided by two independent coin flips (with Heads = 1 
and Tails = 0), as you can verify. The “cr” for each flip would be 1 /2, and the “cr” 
for the sum of the flips would also be 1 /2. 


3.4 Standard deviation of the mean 

Consider the fourth plot in Fig. 3.1, which shows the probability distribution and 
the standard deviation for the number of Heads in 20 fair coin tosses. What if we 
are instead concerned not with the total number of Heads in 20 coin tosses, but 
rather with the average number of Heads per toss (averaged over the 20 tosses)? 
For example, if we happen to get 12 Heads in the 20 tosses (which, by looking at 
the plot, has a probability of about 12%), then the average number of Heads per toss 
is obtained by dividing 12 by 20, yielding 12/20 = 0.6. 

In the same manner, to obtain the entire probability distribution for the average 
number of Heads per toss, we simply need to keep the same dots in the plot in 
Fig. 3.1, but divide the numbers on the x axis by 20. This gives the probability 
distribution shown in Fig. 3.2(a). An average of 0.5 Heads per toss is of course the 
most likely average, and it occurs with a probability of about 18% (the same as the 
probability of getting a total of 10 Heads in 20 tosses). 

The standard deviation of the total number of Heads that appear in n tosses is 
the cr tot = sfn/2 result in Eq. (3.51). The standard deviation of the average number 
of Heads per toss is therefore 


tr avg — 


a-tot 

n 


s/n/2 _ 1 

n 2 yfn ' 


(3.52) 


This is true because if X tot represents the total number of Heads in n flips, and if 
X avg represents the average number of Heads per flip, then A avg = X l(ll /n. Eq. (3.41) 
then gives cr avg = cr tot /n. Equivalently, a given span of the x axis in Fig. 3.2(a) has 
only 1/n (with n = 20) the length of the corresponding span in the fourth plot in 
Fig. 3.1. So the spread in Fig. 3.2(a) is 1/n times the spread in Fig. 3.1. With n = 20, 
Eq. (3.52) gives the standard deviation of the average (which is usually called the 
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Figure 3.2: The probability distribution for the average number of Heads per toss, for 20 or 
2000 coin tosses. 


standard deviation of the mean) as cr avg = 1/(2 V20) = 0.11. This is indicated in 
Fig. 3.2(a); the bar shown has a total length of 2cr avg . 

Let’s repeat the above analysis, but now with 2000 tosses. The probability dis¬ 
tribution for the total number of Heads is peaked around 1000, of course. From 
Eq. (3.51) the standard deviation of the total number of Heads that appear in 2000 
tosses is cr tot = V2000/2 = 22.4. To obtain the probability distribution for the 
average number of Heads per toss (which is peaked around 0.5), we just need to 
divide all the numbers on the x axis (associated with the total number of Heads) 
by 2000, analogous to what we did with 20 tosses, in going from the fourth plot in 
Fig. 3.1 to Fig. 3.2(a). The resulting probability distribution for the average number 
of Heads per toss is shown in Fig. 3.2(b). From Eq. (3.52) the standard deviation 
of the average number of Heads per toss is cr avg = 1/(2 V2000) = 0.011. This is 
indicated in the figure; the (very short) bar shown has a total length of 2<x avg . 

Since 2000 and 20 differ by a factor of 100, the cr tot for 2000 tosses is 10 times 
larger than the cr tot for 20 tosses, because the result in Eq. (3.51) is proportional to 
\pn. But cr avg is 10 times smaller, because the cr avg in Eq. (3.52) is proportional to 
1 / s/Ti. The latter of these two facts is why the bump of points in Fig. 3.2(b) is much 
thinner than the bump in Fig. 3.2(a). The cr avg that we have drawn in Fig. 3.2(b) is 
barely long enough to be noticeable. But even if you don’t calculate and compare 
the standard deviations of the two plots in Fig. 3.2, it is obvious that the bump is 
much thinner in Fig. 3.2(b). 

Let’s recap what we’ve learned. Although cr tot is larger (by a factor of 10) in the 
n = 2000 case, <x avg is smaller (by a factor of 10) in the n = 2000 case. The first of 
these results deals with the absolute (or additive) deviation cr to t from the expected 
value of n/2, while the second deals with the fractional (or multiplicative) deviation 
cr avg from the expected value of 1 /2. The point here is that although the absolute 
deviation <x tot grows with n (which is intuitive), it does so in a manner that is only 
proportional to \fn. So when this deviation is divided by n when calculating the 
average number of Heads, the fractional deviation <x avg ends up being proportional 
to 1 / y/n, which decreases with n (which might not be so intuitive). 

Note that although the expectation value of the average number of Heads per 
toss is independent of the number of tosses (it is always 0.5 for a fair coin, as it is 
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in the two plots in Fig. 3.2(b)), the distribution of the average number of Heads per 
toss does depend on the number of tosses. That is, the shapes of the two curves in 
Fig. 3.2 are different (on the same scale from 0 to 1 on the x axis). For example, 
in the case of 20 tosses, you have a reasonable chance of obtaining an average that 
is 0.6 or more. But in the case of 2000 tosses, you are extremely unlikely to obtain 
such an average. 

Let us formalize the above results with the following theorem. 


Theorem 3.4 Consider a random variable X with standard deviation cr. We make 
no assumptions about the shape of the probability distribution. Let X be the random 
variable formed by taking the average ofn independent trials of the random variable 
X. Then the standard deviation of X is given by cr^ = (Txl yfn, which is often 
written in the slightly more succinct form, 


cr 



(standard deviation of the mean) 


(3.53) 


This is the standard notation, although technically the letter n should appear as a 
label somewhere on the lefthand side of the equation, because the standard deviation 
of X depends on the number n of trials that you are averaging over. 


Proof: Let the n independent trials of the variable X be labeled X\,X 2 ,..., X n . 
(So the Xj are independent and identically distributed random variables.) Then X is 
given by 


- _ Xi + X 2 + ■ ■ ■ + Xn 


(3.54) 


n 


From Eq. (3.41) the standard deviation of X equals l/n times the standard deviation 
of X\ + X 2 + ■ ■ ■ + X„. But from Eq. (3.45) the latter is \JTur. The standard deviation 
of X is therefore \fncr jn = cr/ sfn , as desired. ■ 


In short (as we’ve mentioned a number of times), the above proof comes down to 
the fact that Eq. (3.45) says that the standard deviation of the sum of n independent 
and identical trials grows with n, but only like \JTi. When we take the average and 
divide by n, we obtain a standard deviation of the mean that is smaller than the 
original cr by a factor of sfn. 

More generally, if we are concerned with the average of n different random 
variables with different standard deviations, we can use Eqs. (3.41) and (3.44) to 
say 


CT — O" X!+X 2 H—iXn 

n 

1 

— ~ +X 2 2 - 


(3.55) 
n 

This reduces to Eq. (3.53) when all of the cr X; are equal (even if the distributions 
aren’t the same). 


V 


°x, + 


x 2 


■ +CT- 
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The thinness of the curve in Fig. 3.2(b), which is a consequence of the yfn in 
the denominator in Eq. (3.53), is consistent with the “law of large numbers.” This 
law says that if you perform a very large number of trials, the observed average will 
likely be very close to the theoretically predicted average. In a little more detail: 
many probability distributions, such as the ones in Fig. 3.2, are essentially Gaussian 
(or “normal” or “bell-curve”) in shape. And it can be shown numerically that for 
a Gaussian distribution, the probability of lying within one standard deviation from 
the mean (that is, in the range /j ± cr) is 68%, the probability of lying within two 
standard deviations from the mean is 95%, and the probability of lying within three 
standard deviations from the mean is 99.7%. For wider ranges, the probability is 
effectively 1, for most practical purposes. This is why we mentioned above that for 
2000 coin tosses, the average number of Heads per toss is extremely unlikely to be 
0.6 or larger. Since 0.6 exceeds the mean 0.5 by 0.1, and since <x avg = 0.011 in 
this case, we see that 0.6 is about nine standard deviations above the mean. The 
probability of being more than nine standard deviations above the mean is utterly 
negligible (it’s about 10“ 19 ). 

We threw around a number of terms and results in the preceding paragraph. 
Well eventually get to these. Section 4.8 covers the Gaussian distribution, and 
Chapter 5 covers the law of large numbers and the central limit theorem. This 
theorem explains why many probability distributions are approximately Gaussian. 


Example (Rolling 10,000 dice): 

(a) 10,000 dice are rolled. What is the expectation value of the total number of 6’s 
that appear? What is the standard deviation of this number? 

(b) What is the expectation value of the average number of 6’s that appear per roll? 
What is the standard deviation of this average? 

(c) Do you think you have a reasonable chance of getting a 6 on at least 20% of the 
rolls? 

Solution: 

(a) The probability of getting a 6 on a given roll is p = 1/6, so the expected total 
number of 6’s that appear in the 10,000 rolls is (1/6) • (10,000) = 1667. To find 
the standard deviation of the total number of 6's, we can assign the value 1 to a 
roll of 6, and a value of 0 to the five other rolls. Since p = 1/6, we’re effectively 
flipping a biased coin that has a p = 1/6 chance of success. From Eq. (3.47) the 
standard deviation of the total number of 6’s that come up in 10,000 rolls is 

o-tot = = VnOOOO)G76)(576) = 37. (3.56) 

(b) The expectation value of the average number of 6’s that appear per roll equals 
1667/10,000 = 1/6, of course. The standard deviation of the average is ob¬ 
tained from the standard deviation of the total number of 6’s (given in Eq. (3.56)) 
by dividing by 10,000 (just as we divided by n in the discussion of Fig. 3.2). So 
we obtain cr aV g = 37/10,000 = 0.0037. 

Alternatively, the standard deviation of the average (mean) is obtained from the 
standard deviation of a single roll by using Eq. (3.53). Eq. (3.46) gives the 
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standard deviation of a single roll as (T s j n gi e = ~\fpq = V(l/6)(5/6) = 0.37. So 
Eq. (3.53) gives 


single 



0.37 

VI 0,000 


0.0037. 


(3.57) 


Note that three different cr’s have appeared in this problem: 


cr single = 0.37, (T tot = 37, cr avg = 0.0037. (3.58) 

crtot is obtained from <x s i n gi e by multiplying by sfn (see Eq. (3.45) or Eq. (3.47)), 
while cr aV g is obtained from cr s i na ] e by dividing by sfn (see Eq. (3.53)). Con¬ 
sistent with these relations, cr aV g is obtained from cr to t by dividing by n (see 
Eq. (3.52)), because averaging involves dividing by n. 

(c) You do not have a reasonable chance of getting a 6 on at least 20% of the rolls. 
This is true because 20% of the rolls corresponds to 2000 6’s, which is 333 
more than the expected number 1667. And 333 is 9 times the standard devi¬ 
ation cr tot = 37. The probability of a random process ending up at least nine 
standard deviations above the mean is utterly negligible, as we noted in the dis¬ 
cussion preceding this example. Fig. 3.3 shows the probability distribution for 
the range of ±4cr to t around the mean. It is clear from the figure that even if we 
had posed the question with 18% (which corresponds to 1800 rolls, which is 
about (3.6)crtot above the mean) in place of 20%, the answer would still be that 
you do not have a reasonable chance of getting a 6 on at least 18% of the rolls. 
The probability is about 0.016%. 



Figure 3.3: The probability distribution for the number of 6’s in 10,000 dice 
rolls. 


Alternatively, we can answer the question by working in terms of percentages. 
The standard deviation of cr aV g = 0.0037 is equivalent to 0.37%. The difference 
between 20% and 16.7% is 3.3%, which is 9 times the standard deviation of 
0.37%. The probability is therefore negligibly small. 

Remark: Interestingly, the probability curve in Fig. 3.3 looks quite symmetric 
around the maximum. You might find this surprising, given that the probabili¬ 
ties of rolling a 6 or not a 6 (namely 1/6 and 5/6) aren’t equal. In the special 
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case where the two probabilities in the binomial process are both equal to 1/2 
(as they are in Fig. 3.2), it is clear that the probability curve should be symmet¬ 
ric. But when they aren’t equal, there isn’t a simple argument that explains the 
symmetry. And indeed, for small numbers of dice rolls, the curve definitely isn’t 
symmetric. And also when the whole range (from 0 to 10,000) is included, the 
curve definitely isn’t symmetric (the bump is off to the left at the 1667 mark). 
But for large numbers of rolls, the curve is approximately symmetric in the re¬ 
gion near the maximum (where the probability is nonnegligible). We’ll show in 
Section 5.1 why this is the case. * 


3.5 Sample variance 

[This section is rather mathematical and can be skipped on a first reading.] 

Our goal in this section is to produce an estimate for the standard deviation of a 
probability distribution, given only a collection of randomly chosen values from the 
distribution. Up to this point in the book, we have been answering questions involv¬ 
ing known distributions. In this section we’ll switch things up by starting with some 
data and then trying to determine the probability distribution (or at least one aspect 
of it, namely the standard deviation). We are foraging into the subject of statistics 
here, so this section technically belongs more in a statistics book than a probability 
book. 1 However, we are including it here partly because it provides a nice excuse 
to get some practice with expectation values, and partly because students often find 
the factor of n - 1 in the “sample variance” below in Eq. (3.73) mysterious. We 
hope to remedy this. 

Recall that for a given probability distribution (or equivalently, for a given ran¬ 
dom variable X), the variance is defined in Eq. (3.19) as 

Var(X) = E[(X - p) 2 ], where p = E[X ]. (3.59) 

We usually write Var( X ) as cr 2 , because the standard deviation cr is defined to be 
the square root of the variance. As we noted in Eq. (3.35), the variance also takes 
the (often more convenient) form of cr 2 = E(X 2 ) - p 2 . Since p is a constant, we 
were able to take it outside the E operation in the middle term when going from 
the second to third line in Eq. (3.35). In all of our past calculations, the probability 
distribution was assumed to be given, so both p and cr were known quantities. 

Consider now a setup where we are working with a probability distribution P(x) 
that we don’t know anything about. It may be discrete, or it may be continuous. We 

^Although the words “probability” and “statistics” are often used interchangeably in a colloquial 
sense, there is a difference between the two. In a nutshell, the difference comes down to what you are 
given and what you are trying to find. Probability involves using (perhaps after deriving theoretically) 
a probability distribution to predict the likelihood of future outcomes, whereas statistics involves using 
observed outcomes to deduce the properties of the underlying probability distribution (invariably with 
the eventual goal of using probability to predict the likelihood of future outcomes). Said in a slightly 
different way, probability takes you from theory to experiment, and statistics takes you from experiment 
to theory. 
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don’t know the functional form of P(x), the mean p, the standard deviation cr, or 
anything else. These quantities do exist, of course; there is a definite distribution 
with definite properties. It’s just that we don’t know what they are. Let’s say that 
we try to calculate the variance cr 2 by picking a random sample of n numbers (call 
them xi, X 2 , ..., x n ) and finding their average x, and then finding the average value 
of (x,- - x) 2 . That is, we calculate 


= - 2 ( x ‘ _ x ) 2 
n ^ 


where 



(3.60) 


Note that we cannot use the mean p of the distribution in place of the average x of 
our n numbers, because we don’t know what p is. Although x is likely to be close to 
p (if n is large), it is unlikely to be exactly equal to p, because there will be random 
effects due to the finite size of n. 

A word on notation: The s 2 in Eq. (3.60) means exactly the same thing as Var(.S') 
in Eq. (3.37). We have switched notation to s 2 simply because it is quicker to write. 
We are using s instead of cr, so that we don’t confuse the sum (1/n) (x t - - x) 2 
with the actual variance cr 2 of the distribution, cr 2 involves a theoretical expectation 
value over the entire distribution, not just a particular set of n numbers. As with 
x and p, although s 2 is likely to be close to cr 2 (if n is large), it is unlikely to be 
exactly equal to cr 2 . We are using a tilde in s to distinguish it from the plain letter 
s, which is reserved for the “sample variance” in Eq. (3.73) below. (Some people 
make this distinction by using an uppercase S for s.) We should technically be 
putting a subscript n on both s and s (and x), because these quantities depend on 
the specific set of n numbers. But we have omitted it to keep the calculations below 
from getting too cluttered. 

If we want to reduce the effects of the finite size of n, in order to make ,v 2 be 
as close as possible to the actual cr 2 of the distribution, there are two reasonable 
things we can do. First, we can take the n —> oo limit. This will in fact give the 
actual cr 2 of the distribution, as we will show below. But let’s leave this option 
aside for now. A second strategy is to imagine picking a huge number N of sets of 
n numbers from the distribution, and then taking the average of the N values of ,v 2 , 
each of which is itself an average of the n numbers (x,- - x) 2 . Will this average in 
the N —» oo limit get rid of any effects of the finite size of n and yield the actual cr 2 
for the distribution? It turns out, somewhat surprisingly, that it will not. Instead it 
will yield, as we will show below in Theorem 3.5, an average value of s 2 equal to 


s 2 = 

J avg 


(n - 1)CT 2 


(3.61) 


For any finite value of n, this expression for S 2 vg is smaller than the actual variance 
cr 2 of the distribution. But s 2 vg does approach cr 2 in the n —> oo limit. 

Note that when we talk about taking the average of s 2 over a huge number 
N —> oo of trials, we are equivalently talking about the expectation value of s 2 . This 
is how an expectation value is defined. This expectation value is (using the fact 
that the expectation value of the sum equals the sum of the expectation values; see 
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Eq. (3.7)) 


Elrl = E 


i n 

n t-r* 


■X) 2 


1 

n 


^E^Xt-X) 2 ], 

l 


(3.62) 


where X = (1 /n) E" Xi- 

In going from Eq. (3.60) to Eq. (3.62) and taking the expectation value of s 2 , we 
have made an important change in notation from the lowercase x,- to the uppercase 
Xi. A lowercase x, refers to a specific value of the random variable Xj (where 
the Xi are independent and identically distributed random variables, all associated 
with the given probability distribution). There is nothing random about xp, it has 
a definite value. It would therefore be of no use to take the expectation value of 
(l/n)E" (xj - x) 2 . More precisely, we could take the expectation value if we 
wanted to, but it would simply yield the same definite value of (1 /n) E" (x; - x)“, 
just as the expectation value of the specific number 173.92 is 173.92. In contrast, 
the random variable Xj can take on many values, and when taking an expectation 
value of an expression involving Xj, it is understood that we are averaging over a 
large number N of trials involving (generally) many different x, values of Xj. 

Before we present the general proof that EfS 2 ] = (n— 1 )cr 2 /n, let’s demonstrate 
that this result holds in the special case of n = 2, just to make it more believable. 


Example: In the n = 2 case, show that £[J 2 ] = cr 2 /2. 


Solution: If n = 2, then we have two independent and identically distributed random 
variables, X\ and X 2 . The sum in Eq. (3.62) therefore has only two terms in it, so we 
obtain 


£[*~ 2 ] = \Y J E \{x i -x) 2 } 


~ l' E 


X\ 


Xj + x 2 


+ E 



^) 1 ) 


(3.63) 


The terms in parentheses are equal to ±(Ai - X 2 )/2. The overall sign doesn’t matter, 
because these quantities are squared. The two expectation values are therefore the 
same, so we end up with (using the fact that the expectation value of the sum equals 
the sum of the expectation values) 

£[5 2 ] = l -E[(Xj - X 2 ) 2 ] = i (E[X 2 ] +E[X 2 ] -2E[X l X 2 \) . (3.64) 

Let’s look at the two types of terms here. Since Xj and X 2 are identically distributed, 
the £[A 2 ] and E[X^\ terms are equal. And from Eq. (3.50) this common value is 
E[X 2 } = cr 2 + pr. (Remember that cr and p exist and have definite values, even 
though we don’t know what they are.) For the E[XjX 2 ] term, since X\ and X 2 are 
independent variables, Eq. (3.16) tells us that E[X\X 2 \ = E[X\\E[X 2 \ = p-p = p 2 . 
Plugging these results into Eq. (3.64) gives 


£[^ 2 ] = ^{2 ■ (<r 2 + p 2 ) - 2p 2 ) 


2 


(3.65) 
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This is consistent with the £[J 2 ] = (n - 1 )cr 2 /n result (with n = 2) that we will 
show below. We therefore see that if you want to use £[.v 2 ] (when n = 2) as an 
approximation for the actual variance a 2 of the given distribution, you will be off by 
a factor of 1/2. This isn’t a very good approximation! We will discuss below why 
£[ir] is always an underestimate of <x 2 , for any finite value of n. 

Note that the p terms canceled out in Eq. (3.65). In retrospect, we know that this must 
be the case (for any value of n, not just n = 2), because the original sum in Eq. (3.62) 
is independent of p. This is true because if we shift all of the A,- values by the same 
amount, then the average X also shifts by this amount, so the differences A,- - X are 
unchanged. 


Let’s now prove the general result. The proof is a bit mathematical, but the final 
result will be well worth it. As mentioned at the beginning of this section, we’ll get 
some good practice with expectation values here. 


Theorem 3.5 The expectation value of s 2 (where s 2 is given in Eq. (3.62)) equals 
(n - 1 )/n times the actual variance cr 2 of the distribution. That is, 


£[s 2 ] =E 

-^(A,-A) 2 

n 

(n - 1 )cr 2 

n 





(3.66) 


Proof: If we expand the square in Eq. (3.66), we obtain 


£[s 2 ] = -E 
n 




” + nX 1 


(3.67) 


But 2" A, equals nX, by the definition of X. We therefore have (using the fact that 
the expectation value of the sum equals the sum of the expectation values) 


£[S 2 ] = 


1 


l^E[X 2 ] -2 E[(nX)X] 


nE[x 2 


(3.68) 


As in the above example with n — 2, the E \ X 2 ] terms are all equal, because the A; 
are identically distributed variables. We’ll label the common value as EfA 2 ]. We 
have n such terms, so 


E[s 2 ] = - [iiE[X 2 ] -2 nE[x 2 ] + nE[x 2 ]^ 
= E[X 2 } -e[a 2 ]. 


(3.69) 


This result is similar to the result in Eq. (3.35). There is, however, a critical differ¬ 
ence. X is now a random variable (being the average of the n random variables A, ), 
whereas the p in Eq. (3.35) was a constant. 

Eq. (3.69) contains two terms that we need to evaluate. The E | A 2 ] term is 
simple. From Eq. (3.50) we have 

E[X 2 ] = cr 2 + p 2 . 


(3.70) 
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_ 2 _ _ 2 

The E[X ] term is a bit more involved. X equals (1/n) Xj, so X equals l/n 2 

times the square of the sum of the Xj . When the sum (Xj + X2 + ■ ■ ■ + X n ) is squared, 
there will be n terms like X 2 , X;. etc., which are all identically distributed with a 
common expectation value of £[A 2 ]. And there will be (”) = n(n - l)/2 cross 
terms like 2 X 1 X 2 , 2X\ A 3 , 2 X 2 X 2 , etc., which are again all identically distributed, 
with a common expectation value of, say, E\X\X2] . We therefore have 


e[x 2 } = [riE[X 2 ] + n( ”~ [) E[2X ,A 2 ]j 
= ^_{nE[X 2 ] +n(n-l)E[X 1 \E[X 2 ]) 

= ~2 i n( '°' 2 + v 2 ) + - !) p 2 ) 



n 


As in the above example, we have used the fact that the Xj ’s are independent random 
variables, which allows us to write E\X\X 2 \ - E\Xi\E[X 2 \ = p 2 . But X t isn’t 
independent of itself, of course. That is why £ [A 2 ] isn’t equal to £|A|/:|A| = p 1 . 
Instead, it is equal to cr 2 + p 2 . 

Substituting Eqs. (3.70) and (3.71) into Eq. (3.69) gives 


£[s 2 ] = (cr 2 + p 2 ) - ( — + p~ J = ( -—- ) cr 2 , 
n I \ n I 


as desired. As noted in the n = 2 example above, the p dependence drops out. ■ 

In Eq. (3.71) we chose to derive the value of E\X ] from scratch by working 
through some math, because this type of calculation will be helpful if you want 
to tackle Problem 3.12. However, there is a much quicker way to find E\X ]. 
From Eq. (3.50) we know that the expectation value of the square of a random 
variable equals the square of the mean plus the square of the standard deviation. 
With X = (Xi + X 2 + ■ ■ ■ + X„)/n as our random variable, the mean is p, of course. 
And from Eq. (3.53) the standard deviation is cr/ x/77. The cr 2 /n + p 2 result in 
Eq. (3.71) then immediately follows. 

Let’s recap what the above theorem implies. If you want to determine the true 
variance cr 2 of an unknown distribution by picking numbers, you have two main 
options: 

• You can pick a huge set of numbers, because in the n —> 00 limit, ,v 2 ap¬ 
proaches cr 2 . This is due to two effects. First, the (n-\ )/n factor in Eq. (3.72) 
approaches 1 in the n — > 00 limit, so E (,v 2 ] equals cr 2 . And second, the re¬ 
sult from Problem 3.12 tells us that the spread of the values of s 2 around its 
expected value (which is cr 2 in the n —> 00 limit) goes to zero in the n —> 00 
limit. So s 2 is essentially guaranteed to be equal to cr 2 . 

• You can pick a set with a “normal” size n (say, 20, although a small number 
like 2 will work fine too) and calculate the variance s 2 of the set of n numbers. 
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You can then repeat this process a huge number A' —> oo of times and take the 
average of the N variances you have calculated. From Eq. (3.53), the standard 
deviation of this average will be proportional to 1 / yN and will therefore be 
very small. The average will therefore be very close to the expected value of 
s 2 , which from Eq. (3.72) is (n - 1 )cr 2 /n. This is always an underestimate of 
the actual cr 2 of the distribution. But if you multiply by n/(n - 1), then you 
will obtain cr 2 . 

In the above proof, we proved mathematically that E [ ,v 2 ] = (n - 1 )cr 2 /n. But 
is there an intuitive way of at least seeing why £[s 2 ] is smaller than cr 2 , leaving 
aside the exact (n - 1 )/n factor? Indeed there is, and in the end it comes down to the 
fact that X is a random variable instead of a constant. In Eq. (3.66) the consequence 
of this is that if we look at specific Xj values of the X, random variables, then the 
(xi - I) 2 term, for example, is smaller (on average) than (xj - p) 2 . This is true 
because x involves X] , which implies that if X\ is, say, large, then the mean will 
be shifted upward slightly toward x\. (The average of the other n — 1 numbers 
equals p, on average. So the average of all of the n numbers including x\ must lie 
a little closer to x\, on average.) This effect is most pronounced for small n, such 
as n — 2. Another line of reasoning involves looking at Eqs. (3.71) and (3.72). 

E[X~\ is larger than p 2 (by an amount cr 2 /«), due to the fact that the value of X 

_2 

generally differs slightly from p. The square of this difference contributes to £ [A ]; 
see the paragraph immediately following the proof. So a number larger than p 2 is 
subtracted off in Eq. (3.72). 

As mentioned above, a quick corollary of Theorem 3.5 is that if we multiply 
E\ ,v 2 | by n/(n - 1), we obtain the actual variance cr 2 of the distribution. This 
suggests that we might want to define a new quantity that is a slight modification of 
the s 2 in Eq. (3.60). We’ll label it as s 2 : 



(sample variance) 


(3.73) 


This quantity s 2 is called the sample variance. We’ll discuss this terminology below. 
s 2 is a function of a particular set of n numbers, x\ through x n . just as ,v 2 is. But the 
expectation value of s 2 doesn’t depend on n. The combination of Eqs. (3.72) and 
(3.73) tells us that the expectation value of s 2 is simply cr 2 : 


£|> 2 ] = cr 2 


(3.74) 


Our original quantity s 2 is a biased estimator of cr 2 , in that its expectation value 
£[s 2 ] depends on n and is smaller than cr 2 by the factor ( n — l)/n. Our new quantity 
,v 2 is an unbiased estimator of cr 2 , in that its expectation value E | ,y 2 1 is independent 
of n and equals cr 2 . To summarize, the two quantities 


s - Z (xi - x)1 


s 2 = 


1 

n - 1 


Yu (*«' ~~ Y > 2 


and 


(3.75) 
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have expectation values of 

£[s 2 ] = ---— and Efs 2 ] = cr 2 . (3.76) 

n 

A word on terminology: The quantity s 2 is called the “variance” of a particular 
set of n numbers, while the quantity s 2 is called the “sample variance” of the set. 
When talking about the sample variance, it is understood that you are concerned 
with producing an estimate of the actual variance of the underlying distribution. 
(This variance is often called the population variance , in view of the fact that it 
takes into account the entire population of possible outcomes, as opposed to just 
a sample of them.) The sample variance s 2 has the correct expectation value of 
cr 2 . However, this terminology can get a little tricky. What if someone asks you to 
compute the variance of a sample of n numbers? Even though the word “sample” 
is used here, you should calculate s 2 , because you are being asked to compute the 
variance of a set/sample of numbers, and the variance is defined via Eq. (3.37) or 
Eq. (3.60), with an n in the denominator. If someone actually wants you to compute 
the sample variance, then they should use this specific term, which is defined to be 
s 2 , with an (n - 1) in the denominator. Of course, any ambiguity in terminology 
can be eliminated by simply using the appropriate symbol (,v 2 or ,v 2 ) in addition to 
words. 

Terminology aside, which of ,v 2 or ,v 2 should you be concerned with if you are 
given a set of n numbers? Well, if you are concerned only with these particular 
n numbers and nothing else (in particular, the underlying distribution, if the num¬ 
bers came from one), then you should calculate s 2 . This is the variance of these 
numbers. 2 But if the n numbers come from a distribution or a larger population, 
and if you are concerned with making a statement about this distribution or popu¬ 
lation, then you should calculate s~, because this gives an unbiased estimate of cr 2 . 
However, having said this, it is often the case that n is large enough so that the dis¬ 
tinction between the n and the n — 1 in the denominators of s 2 and ,v 2 doesn’t matter. 
To summarize, the three related quantities we have encountered are: 


cr 2 : Distribution variance, or population variance. 

s 2 : Variance of a set of n numbers (a biased estimator of cr 2 ). 

s 2 : Sample variance of a set of n numbers (an unbiased estimator of <r 2 ). 


Example (n = 100): We proved Theorem 3.5 mathematically, but let’s now give some 
numerical evidence that E [ s 2 | is in fact equal to cr 2 . We’ll arbitrarily choose n = 100. 
To demonstrate £[i 2 ] = cr 2 , we’ll numerically generate N = 10 5 sets of n = 100 
values from a Gaussian (normal) distribution with p = 0 and cr = 1. (Eq. (3.74) holds 
for any type of distribution, so our Gaussian choice isn’t important. We’ll discuss the 


2 If you are concerned only with this set of numbers and nothing else, then you can rightly call the set 

a “population,” in which case you can rightly call s~ a “population” variance. But we’ll just call it s 2 . 





162 


Chapter 3. Expectation values 


Gaussian distribution in Section 4.8.) The p value here is irrelevant, as we have noted. 
The N = 10 5 number is large enough so that we'll pretty much obtain the expectation 
value £’[.s 2 ]; see the second remark below. 

The results of a numerical run are shown in Fig. 3.4. For each of the N = 10 5 sets 
of n = 100 values, we calculated the s 2 given by Eq. (3.73). The histogram gives the 
distribution of the N values of s 2 . The average of these N values is 1.00062, which is 
very close to a 2 = 1, consistent with Eq. (3.74). 



Figure 3.4: A histogram of the sample variances s 2 of N = 10 5 sets of numbers, with 
each set consisting of n = 100 numbers chosen from a Gaussian distribution with 
cr = 1. 


Remarks: 

1. If you are interested in calculating the spread of the histogram, see Problem 3.12. 
A corollary to that problem is that if the underlying probability distribution 
is Gaussian (so now the Gaussian assumption matters), and if n is large, then 
VarG 2 ) « 2cr 4 /n. In the present setup with cr = 1 and n = 100, this gives 
VarG 2 ) a 0.02. The standard deviation of the s 2 values is therefore about 
V0.02 a 0.14. This is consistent with a visual inspection of the histogram. 

2. If we make N larger (say, 10 6 or 10 7 ), the spread of the histogram remains 
the same. The standard deviation is still 0.14, because the variance VarG 2 ) a 
2cr 4 /n depends only on n, not on N. So the histogram will look the same. As 
far as the average value of s'- (which is 1.00062 for the data in Fig. 3.4) goes, a 
larger N means that it is more likely to be very close to cr 2 = 1, due to the result 
in Eq. (3.53) for the standard deviation of the mean. (This effect is too small 
to see, so the histogram will still look the same.) Remember that the standard 
deviation of the average of N independent and identically distributed variables 
(the N values of s 2 here) is always smaller than the standard deviation of each 
of the variables (which is sjlo-^jn here), by a factor of 1/yfN. 

In the present case, the cr in Eq. (3.53) is 0.14, and the n there is now N. If 
N = 10 5 , then Eq. (3.53) says that the standard deviation of the average of the N 
values of s 2 is (0.14)/ VlO 5 = 4.4-10 -4 . Our above numerical result of 1.00062 
for the average of s 2 is therefore about one and a half standard deviations from 
the expected value (cr 2 = 1), which is quite reasonable. Larger values of N will 
cause the average of the N values of s 2 to be even closer to 1 (on average). * 



























3.6. Summary 


163 


If you want to produce an estimate of the standard deviation cr of a distribution, 
there are two things you might want to do. You can pick a value of n (say, 10) and 
then take the average of a large number N (say, a million) of values of the sample 
variance s 2 of n numbers. For very large N, this will essentially give you cr 2 , by 
Eq. (3.74). You can then take the square root to obtain cr. This is a valid method 
for obtaining cr. Or, you can take the average of a large number N of values of 
the sample standard deviation s, each of which is the square root of the ,v 2 given in 
Eq. (3.73). However, this second method will not give you cr. Your calculated aver¬ 
age will be smaller than cr; see Problem 3.11. Therefore, although Eq. (3.74) tells 
us that the sample variance s 2 is an unbiased estimator of the distribution variance 
cr 2 , it is not true that the sample standard deviation s is an unbiased estimator of the 
distribution standard deviation cr. 


3.6 Summary 

• A random variable is a variable that can take on certain numerical values with 
certain probabilities. A random variable is denoted with an uppercase letter, 
such as X, while the actual values that the variable can take on are denoted 
with lowercase letters, such as x. 

• The expectation value of a random variable X is the expected average value 
of the variable, over a large number of trials. It is given by 

E(X) = pxx i +p 2 x 2 + ■■■ +p m x m , (3.77) 

where the x’s are the possible outcomes and the p’s are the associated proba¬ 
bilities. The expectation value of the sum of two variables equals the sum of 
the individual expectation values: 

E(X + Y) = E(X) +E(Y). (3.78) 

The expectation value of the product of two independent variables equals the 
product of the individual expectation values: 

E(XY) = E(X) ■ E(Y) (independent variables) (3.79) 

• The variance of a random variable is related to the spread of the possible 
outcomes of the variable. It is given by 

Var(A) = E[(X-p) 2 ]. (3.80) 

It can also be written as Var(A) = E(X 2 ) - p 2 . The variance of the sum of 
two independent variables equals the sum of the individual variances: 

Var(A + Y) = Var(A) + Var(T) (independent variables) (3.81) 

The variance of the number of Heads in n tosses of a biased coin (involving 
probabilities p and 1 - p = q) is 


Var(Heads in n flips) = npq (biased coin) 


(3.82) 
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The variance of a set S of n numbers X; is 


1 " 

= Var (S) = - V (X{ -x) 2 . 
n 


(3.83) 


• The standard deviation of a random variable gives a rough measure of the 
spread of the possible outcomes of the variable. It is defined as the square 
root of the variance, so we can write it in two ways: 


crx = VVar(X) = ^E[(X-p) 2 ] = ^E(X 2 )-p 2 . (3.84) 

If X and Y are two independent random variables, then 

a\ +Y = cr\ + cry (independent variables) (3.85) 

This is the statement that standard deviations “add in quadrature” for inde¬ 
pendent variables. The standard deviation of the number of Heads in n tosses 
of a biased coin (involving probabilities p and 1 — p = q) is 

cr = y/npq (biased coin) (3.86) 


• The standard deviation of the mean is the standard deviation of the average 
of a set of n random variables. If each of the random variables has the same 
standard deviation cr, then the standard deviation of their average equals 

cr 

cty = —p. (standard deviation ol the mean) (3.87) 

yn 


• The sample variance s 2 of a set of n numbers v, chosen from a given distri¬ 
bution is defined as 


= 


■ J] (Xi - x) 2 


(3.88) 


The sample variance has the property that its expected value equals the actual 
variance cr 2 of the distribution. 


3.7 Exercises 

See www.people.fas.harvard.edu/~djmorin/book.html for a supply of problems 
without included solutions. 
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3.8 Problems 


Section 3.1: Expectation value 


3.1. Flip until Heads * 

In Example 2 on page 136, we found that if you flip a coin until you get a 
Heads, the expectation value of the total number of coins is 


-•1 + -- 2+ -- 3 + —- 4+ —- 5 + 
2 4 8 16 32 


(3.89) 


We claimed that this sum equals 2. Demonstrate this by writing the sum as a 
geometric series starting with 1 /2, plus another geometric series starting with 
1 /4, and so on. You can use the fact that the sum of a geometric series with 
first term a and ratio r is a/(l — r). 


3.2. HT waiting time ** 

We know from Example 2 on page 136 that the expected number of flips 
required to obtain a Heads is 2. What is the expected number of flips required 
to obtain a Heads and a Tails in succession (in that order)? 


3.3. Sum of dependent variables ** 

Consider the example on page 137, but now let X and Y be dependent in 
the following manner: If Y = 1, then it is always the case that X = 1. If 
Y = 2, then it is always the case that X = 2. If Y = 3, then there are equal 
chances of X being 1 or 2. If we assume that Y takes on the values 1, 2, and 
3 with equal probabilities of 1/3, then you can quickly show that X takes on 
the values 1 and 2 with equal probabilities of 1/2. So we have reproduced 
the probabilities in the original example. Show (by explicitly calculating the 
probabilities of the various outcomes) that in the present scenario where X 
and Y are dependent, the relation E(X + Y) = E(X ) + E(Y) still holds. 

3.4. Playing “unfair” games ** 


(a) Assume that later on in life, things work out so that you have more than 
enough money in your retirement savings to take care of your needs 
and beyond, and that you truly don’t have a need for any more money. 
Someone offers you the chance to play a one-time game where you have 
a 3/4 chance of doubling your money, and a 1/4 chance of losing it all. 
If you initially have N dollars, what is the expectation value of your 
resulting amount of money if you play the game? Would you want to 
play it? 

(b) Assume that you are stranded somewhere, and that you have only $10 
for a $20 bus ticket. Someone offers you the chance to play a one-time 
game where you have a 1/4 chance of doubling your money, and a 3/4 
chance of losing it all. What is the expectation value of your resulting 
amount of money if you play the game? Would you want to play it? 
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3.5. Simpson’s paradox ** 

During the baseball season in a particular year, player A has a higher batting 
average than player B. In the following year, A again has a higher average 
than B. But to your great surprise when you calculate the batting averages 
over the combined span of the two years, you find that A’s average is lower 
than B’s! Explain, by giving a concrete example, how this is possible. 

Section 3.2: Variance 

3.6. Variance of a product * 

Let X and Y each be the result of independent (and fair) coin flips where we 
assign the value 1 to Heads and 0 to Tails. Show that Var(AT) is not equal to 
Var(A)Var(T). 

3.7. Variances * 

For each of the three examples near the beginning of Section 3.2, show that 
the alternative E(X 2 ) - p 2 form of the variance given in Eq. (3.34) leads to 
the same results we obtained in the examples. 

Section 3.3: Standard deviation 

3.8. Random walk ** 

Consider the following one-dimensional random walk. A person starts at the 
origin and then takes n successive steps. Each step is equally likely to be to 
the right or to the left. All steps have the same length. 

(a) What is the probability that the person is located back at the origin after 
the nth step? 

(b) After n steps, what is the standard deviation of the person’s position 
relative to the origin? (Assume that the length of each step is, say, one 
foot.) 

Section 3.4: Standard deviation of the mean 

3.9. Expected product, without replacement ** 

Consider a set of N given numbers, a i, 02 , ..., a\<. Let the mean of these N 
numbers be p, and let the standard deviation be cr. Draw two numbers X\ and 
X 2 randomly without replacement. Show that the expectation value of their 
product is 

E[XiX 2 ] = p 2 - - (3.90) 

Hint: All of the a,-ay possibilities (with i 4 j) are equally likely. 

3.10. Standard deviation of the mean, without replacement *** 

Consider a set of N given numbers, aq, a:, ..., a N . Let the mean of these 
N numbers be p, and let the standard deviation be cr. Draw a sample of n 
numbers Xj randomly without replacement , and calculate their sample mean, 
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2 Xj/n. (n must be less than or equal to N, of course.) The variance of the 
sample mean is £[( £ W /n - p) 2 ]. Show that this variance is given by 


E 





n 



1 


N- 1 


(3.91) 


The standard deviation of the sample mean is the square root of this. The 
result from Problem 3.9 will come in handy. 


Section 3.5: Sample variance 

3.11. Biased sample standard deviation ** 

We mentioned on page 163 that the sample standard deviation s is a biased 
estimator of the distribution standard deviation cr. The basic reason for this is 
that the square root operation is nonlinear, which means that the square root 
of the average of a set of numbers isn’t equal to the average of their square 
roots. For example, the average of 1.1 and 0.9 is 1, but the average of V1 • 1 
and VO.9 isn’t 1. It is smaller than 1. Let’s give a general proof that E[s] < cr 
(unlike = cr 2 ). 


If we calculate the sample variances for a large number N of sets of n num¬ 
bers, then the .E[s 2 ] = cr 2 equality in Eq. (3.74) tells us that in the (V —> °o 
limit, we have 


st + ■ ■ ■ + s 


N 


N 


= cr " 


(3.92) 


Our goal is to show that 


si + S2 + • • ' + SJV 

N 


< cr, 


(3.93) 


in the (V —> °o limit. To demonstrate this, square both sides of Eq. (3.93) 
and make copious use of the arithmetic-geometric-mean inequality, sfab < 
(a + b)/ 2. 

3.12. Variance of the sample variance 

Consider the sample variance s 2 (given in Eq. (3.73)) of a sample of n values, 
X\ through X n , chosen from a distribution with standard deviation cr and 
mean p. We know from Eq. (3.74) that the expectation value of s 2 is cr 2 , so 
the variance of s 2 (that is, the variance of the sample variance) is Var(s 2 ) = 
E [(s 2 - <r 2 ) 2 ]. The square root of this variance gives a measure of the spread 
of the results if you calculate ,v 2 for many different sets of n numbers (as we 
did in Fig. 3.4). Show that Var(s 2 ) equals 


Var(s 2 ) 


1 

n 


Pa ~ cr 


4 


n — 3 
n — 1 


(3.94) 


where p 4 is the distribution’s fourth moment relative to the mean, that is. 
Pa = E ((X -p) 4 ]. The math here is extremely tedious, so you should attempt 
this problem only if you really enjoyed the proof of Theorem 3.5. Whatever 
adjective comes to mind for that proof, multiply it by 10 for this problem! 
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3.13. Sample variance for two dice rolls ** 


(a) We know from the first example in Section 3.2 that the variance of a 
single die roll is cr 2 = 2.92. If you use Eq. (3.73) to calculate the 
sample variance s 2 for n — 2 dice rolls, the expected value of s 2 should 
be cr 2 = 2.92, according to Eq. (3.74). By considering the 36 equally 
likely pairs of dice in Table 1.5, verify that this is indeed the case. 

(b) Using the information you generated from Table 1.5, calculate Var(.v 2 ). 
Then show that the result agrees with the expression for Var(s 2 ) in 
Eq. (3.94), with n = 2. 


3.9 Solutions 


3.1. Flip until Heads 

The given sum equals 



1 

1 

+ —— 

+ —— 

16 

32 

1 

1 

+ - 

+ - 

16 

32 

1 

1 

+ - 

+ - 

16 

32 

1 

1 

H- 

+ - 

16 

32 


(3.95) 


This has the correct number of each type of term. For example, a “1/16” appears four 
times. The first line is a geometric series that sums to a/(l —r) = (1/2)/(1 - 1/2) = 1. 
The second line is also a geometric series, and it sums to (l/4)/(l - 1/2) = 1/2. 
Likewise the third line sums to (l/8)/(l - 1/2) = 1/4. And so on. The sum of the 
infinite number of lines in Eq. (3.95) therefore equals 


, 1111 1 
1 + — + — + — + — + — + • • • . 
2 4 8 16 32 


(3.96) 


But this itself is a geometric series, and it sums to a/( 1 - r) = 1/(1 - 1/2) = 2, as 
desired. 


3.2. HT waiting time 

Our goal is to find the average number of flips to obtain an HT pair (including these 
two flips). We know that the average number of flips to obtain an H is 2. The impor¬ 
tant point to now realize is that once we obtain our first H, the game ends when we 
eventually obtain a T. This is true because if we obtain a T on the following flip, then 
we have obtained our HT, so we're done. If. on the other hand, we obtain a T, say, 
four flips later (that is, if we obtain three more H’s and then a T), then our string looks 
like ... HHHHT, so we have obtained our HT pair. Basically, in any scenario, once 
we’ve obtained our first H, the first subsequent appearance of a T, whenever that may 
be. must necessarily follow an H, which means that we have obtained our HT pair. 
We can therefore answer the original HT question if we can answer the question: How 
many flips on average does it take to obtain a T, following an H? Now, since H and T 
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are interchangeable, this is exactly the same question as: How many flips on average 
does it take to obtain an H, starting at the beginning? (This is true because future flips 
can't depend on past flips. So we can imagine starting the process whenever we want. 
Starting after the first H is as valid a place to start as the actual beginning.) We already 
know that the answer to this question is 2. The average number of flips to obtain an 
HT string is therefore 2+2=4. It takes an average of two flips to obtain an H, and then 
an average of two more flips to obtain a T, at which point we necessarily have our HT 
sequence, as we noted above. 

3.3. Sum of dependent variables 

In the ( X,Y ) notation, the given information tells us that there is a 1/3 chance of 

obtaining (1,1), a 1/3 chance of obtaining (2,2), a 1/6 chance of obtaining (1,3), 

and a 1/6 chance of obtaining (2,3). Both (2,2) and (1,3) yield a sunt of 4, so the 
probabilities of the various values of X + Y are 

P( 2) = y , P( 3) = 0. P( 4) = l + l = \, P( 5) = \ . (3.97) 

3 5 o l o 

Eq. (3.4) then gives the expectation value of X + Y as 

E{X + T) = ^-•2 + 0- 3+ ^- 4+ j- 5= —- = 3.5. (3.98) 

3 2 6 6 

This equals E{X) + E(Y) = 1.5 + 2 = 3.5, as Eq. (3.7) claims. 

3.4. Playing “unfair” games 

(a) The expectation value of your money after you play the game is (3/4) • 2 N + 
(1/4) • 0 = 3N/2. So you will gain N/2 dollars, on average. It therefore seems 
like it would be a good idea to play the game. However, further thought shows 
that it would actually be a bad idea. There is basically no upside; you already 
have plenty of money, so twice the money won't help much. But there is a huge 
downside; you might lose all your money, and that would certainly be a bad 
thing. 

The point here is that the important issue is your happiness, not the exact amount 
of money you have. On the happiness scale (from 0 to 1), you stand to gain 
nothing (or perhaps a tiny bit). Your happiness starts pretty much at 1, and even 
if you win the game, you can't climb any higher than 1. But you stand to lose 
a huge amount. This isn't to say that you can’t be happy without money. But 
if you lose your entire savings, there's no doubt that it would put a damper on 
things. Let's assume that if you lose the game, your happiness decreases roughly 
to 0. Then if you play the game, the expectation value of your happiness is 
essentially (3/4) • 1 + (1/4) • 0 = 3/4. This is less than the starting value of 1, 
so it suggests that you shouldn't play the game. However, there is still another 
thing to consider; see the remark below. 

(b) The expectation value of your money after you play the game is (3/4) • 0 + 
(1/4) • 20 = 5. So you will lose $5, on average. It therefore seems like it 
would be a bad idea to play the game. However, the $10 in your pocket is just 
as useless as $0, because either way, you’re guaranteed to be stuck at the bus 
station. You therefore should play the game. That way, at least there’s a 1/4 
chance that you'll make it home, (We’ll assume that the overall money you 
have back home washes out any effect of gaining or losing $10, in the long run.) 
The same argument we used above with the happiness level holds here. $0 and 
$10 yield the same level of happiness (or perhaps we should say misery), so 
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there is basically no downside. But there is definitely an upside with the $20, 
because you can then buy a ticket. The expectation value of your happiness (on 
a scale from 0 to 1) is essentially (3/4) • 0 + (1/4) • 1 = 1/4. This is greater than 
the starting value of 0. so it suggests that you should play the game. But see the 
following remark. 

Remark: There is another consideration with these sorts of situations, in that 
they are one-time events. Even if we rig things so that the expectation value 
of your happiness level (or whatever measure you deem to be the important 
one) increases, it’s still not obvious that you should play the game. Just as with 
any other probabilistic quantity, the expectation value has meaning only in the 
context of a large number of identical trials. You could imagine a situation 
where a group of many people play a particular game and the average happiness 
level increases. But you are only one person, and the increase in the overall 
happiness level of the group is of little comfort to you if you lose your shirt. 
Since you play the game only once, the expectation value is irrelevant to you. 
The decision mainly comes down to an assessment of the risk. Different people’s 
reactions to risk are different, and you could imagine someone being very risk- 
averse and never playing a game with a significant downside, no matter what 
the upside is. * 

3.5. Simpson’s paradox 

The two tables in Table 3.1 show an extreme scenario that gets to the heart of the 
matter. In the first year, player A has a small number of at-bats (6), while player B 
has a large number (600). In the second year, these numbers are reversed. You should 
examine these tables for a minute to see what’s going on, before reading the next 
paragraph. 



First year 

Second year 

Player A 

3/6 (.500) 

150/600 (.250) 

Player B 

200/600 (.333) 

1/6 (.167) 



Combined years 

Player A 

153/606 (.252) 

Player B 

201/606 (.332) 


Table 3.1: Yearly and overall batting averages. The years with the large numbers of 
at-bats dominate the overall averages. 

The main point to realize is that in the combined span of the two years. A’s average is 
dominated by the .250 average coming from the large number of at-bats in the second 
year (yielding an overall average of .252, very close to .250), whereas B’s average is 
dominated by the .333 average coming from the large number of at-bats in the first 
year (yielding an overall average of .332, very close to .333). B’s .333 is lower than 
A’s .500 in the first year, but that is irrelevant because A’s very small number of at- 
bats that year hardly affects his overall average. Similarly, B's .167 is lower than A’s 
.250 in the second year, but again, that is irrelevant because B’s very small number of 
at-bats that year hardly affects his overall average. What matters is that B’s .333 in the 
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first year is higher than A’s .250 in the second year. The large numbers of associated 
at-bats dominate the overall averages. 

Fig. 3.5 shows a visual representation of the effect of the number of at-bats. The size 
of a data point in the figure gives a measure of the number of at-bats. So although B’s 
average is lower than A’s in each year, the large B data point in the first year is higher 
than the large A data point in the second year. These data points are what dominate 
the overall averages. 


avg 



Figure 3.5: Visual representation of Simpson’s paradox. The large data points domi¬ 
nate the overall averages. 


Remarks: 

1. To generate the paradox where B’s overall average surprisingly ends up being 
higher than A’s overall average, the higher of B’s two yearly averages must be 
higher than the lower of A’s two yearly averages. If this weren't the case (that 
is, if the large B data point in Fig. 3.5 were lower than the large A data point), 
then A’s overall average would necessarily be higher than B’s overall average 
(as you can verify). So the paradox wouldn’t be realized. 

2. To generate the paradox, we must also have a disparity in the number of at-bats. 
If all four of the yearly at-bats in the first of the tables in Table 3.1 were the same, 
then A’s overall average would necessarily be higher than B’s overall average 
(as you can verify). The main point of the paradox is that when calculating the 
overall average for a given player, we can’t just take the averages of the two 
averages. A year with more at-bats influences the average more than a year with 
fewer at-bats, as we saw above. 

The paradox can certainly be explained with at-bats that don’t have values as 
extreme as 6 and 600, but we chose these in order to make the effect as clear as 
possible. Also, we chose the total number of at-bats in the above example to be 
the same for A and B over the two years, but this of course isn't necessary. 

3. The paradox can also be phrased in terms of averages on exams, for example: 
For 10th graders taking a particular test, boys have a higher average than girls. 
For 11th graders taking the same test, boys again have a higher average than 
girls. But for the 10th and 11th graders combined, girls have a higher average 
than boys. Another real-life example deals with college admissions rates. The 
paradox can arise when looking at male/female acceptance rates to individual 
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departments, and then looking at the male/female acceptance rates to the college 
as a whole. (The departments are analogous to the different baseball years.) 

4. One shouldn’t get carried away with Simpson’s paradox. There are plenty of 
scenarios where it doesn’t apply, for example: In a particular school, the per¬ 
centage of soccer players in the 10th grade is larger than the percentage of mu¬ 
sicians. And the percentage of soccer players in the 11th grade is again larger 
than the percentage of musicians. Can the overall percentage of soccer players 
(in the combined grades) be smaller than the overall percentage of musicians? 
The answer to this question is a definite “No.” One way to see why is to consider 
the numbers of soccer players and musicians, instead of the percentages. Since 
there are more soccer players than musicians in each grade, the total number 
(and hence percentage) of soccer players must be larger than the total number 
(and hence percentage) of musicians. 

Another way to understand the “No” answer is to note that when calculating the 
percentages of soccer players and musicians in a given grade, we’re dividing the 
number of students in each group by the same denominator (namely, the total 
number of students in the grade). We therefore can’t take advantage of the effect 
in the baseball scenario above, where B’s average was dominated by one year 
while A’s was dominated by a different year, due to the different numbers of at- 
bats in a given year. Instead of the data points in Fig. 3.5, the present setup might 
yield something like the data points in Fig. 3.6. The critical feature here is that 
the dots in each year have the same size. The dots for the 11th grade happen to 
be larger because we’re arbitrarily assuming that there are more students in that 
grade. The total percentage of soccer players in the two years is the weighted 
average of the two soccer dots (weighted by the size of the dots, or equivalently 
by the number of students in each grade). Likewise for the two music dots. The 
soccer weighted average is necessarily larger than the music weighted average. 
(This is fairly clear intuitively, but as an exercise you can prove it rigorously if 
you have your doubts.) * 


50% 

40% 

30% 

20% 

10 % 


• soccer 

• music 

• soccer 

• music 

—i- 1 - grade 

10th 11th 


Figure 3.6: Simpson’s paradox doesn’t apply in this case. The overall per¬ 
centage of soccer players is necessarily larger than the overall percentage of 
musicians. 
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3.6. Variance of a product 

The random variable XY takes on the values of 0 and 1 with probabilities 1/4 and 
3/4. because only HH yields an XY value of 1. The other three outcomes (HT. TH, 
TT) all yield 0. We therefore effectively have a single biased coin with probability 
p = 1/4 of obtaining a value of 1, and q = 3/4 of obtaining a value of 0. Eq. (3.33) 
then tells us that the variance of XY is npq = 1 • (1/4) • (3/4) = 3/16. And we 
know from Eq. (3.21) (or Eq. (3.33)) that the variance of each single (fair) coin flip is 
Var(Y) = Var(F) = 1/4. So Var(Y)Var(T) = (l/4)(l/4) = 1/16, which is not equal 
to Var(AT) = 3/16. 

3.7. Variances 

For a die roll, we have 

E( X 2 ) = i(l 2 + 2 2 + 3 2 +4 2 + 5 2 + 6 2 ) = ^- = 15.17. (3.99) 

6 v ’ 6 

And p = 3.5, so the variance is E(X 2 ) - p 2 = 15.17 - 3.5 2 = 2.92, as desired. 

For a fair coin flip, we have 

E(X 2 ) = ^(l 2 +0 2 ) = l ~. (3.100) 

And p = 1/2, so the variance is E(X 2 ) - p 2 = 1/2 - (1/2) 2 = 1/4, as desired. 

For a biased coin flip, we have 

E{X 2 ) = p-l 2 + (l-p)-0 2 = p. (3.101) 

And p = p, so the variance is E(X 2 ) - p 2 = p - p 2 = p( \ - p) = pq, as desired. 

3.8. Random walk 


(a) If the person ends up back at the origin after n steps, then it must be the case 
that n/2 of the steps were to the right and n/2 were to the left. (Note that this 
immediately tells us that n must be even, if there is to be any chance of ending up 
at the origin.) You can imagine the person flipping a coin, with Heads meaning 
a step to the right and Tails meaning a step to the left. So the given problem is 
equivalent to finding the probability that you obtain equal numbers n/2 of Heads 
and Tails in n coin flips. There is a total of 2" possible outcomes (all equally 
likely) for the collection of n flips, and („"->) of these have exactly n/2 each of 
Heads and Tails. So the probability of obtaining exactly n/2 Heads is 


1 I n \ 1 n! 

P 2 n \n/2/ 2" ((n/2)!) 2 ' 


(3.102) 


If n is even, this is the desired probability of ending up back at the origin. If n 
is odd, the probability is zero. 

If n is large, then Eq. (2.66) gives an approximation to Eq. (3.102). The n in 
Eq. (2.66) corresponds to n/2 here, so the above result becomes 1/ s/n(n/2) = 
V2 /nn. For example, after n = 100 steps, the probability of being at the origin 
is about 8%. 

(b) First solution: Consider a single step, with the two possible outcomes of +1 
and -1 (in feet). The mean displacement during the step is p = 0, so the first (or 
second) expression in Eq. (3.40) gives the standard deviation of a single step as 


o-i = sjE[(X-p ) 2 ] = -J (1/2) • (l) 2 + (1/2) • (— 1 ) 2 = 1. 


(3.103) 
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This makes sense, because the square of the length of a single step is guaranteed 
to be 1. The standard deviation of n independent steps (involving identical 50-50 
processes) is then given by Eq. (3.45) as 

ct„ = s/n-cri = sfn. (3.104) 

Remark: Since our random walk is basically the same as a series of coin flips 
(with Heads and Tails corresponding to 1 and -1 instead of the usual 1 and 0 
that we have used in the past), the probability distribution for where the person 
is after n steps has the same basic shape as the binomial distribution in, say, the 
fourth plot in Fig. 3.1. In particular, we can make the same types of statements 
we made right before the example in Section 3.4. For example, assuming that 
the number of steps is large, there is a 99.7% chance that after n steps the person 
is within 3<x„ = 3 sfn of the origin. So for n = 10,000 steps, the person is 99.7% 
likely to be within 3 sfti = 300 steps of the origin. * 

Second solution: We can solve the problem from scratch, without invoking 
Eq. (3.45). This method allows us to see intuitively where the cr n = sfn result 
comes from. Let the n steps be represented by the independent and identically 
distributed random variables Xj. Each Xj can take on the value of +1 or -1, 
with equal probabilities. Let Z be the sum of the Xj . So Z is the position after 
the n steps. We then have (see below for an explanation of these equations) 

Z = Xi+X 2 + X 3 + --- + X„ 

=> Z 2 = (X 1 + X 2 + x 3 + --- + x n ) 2 

=> Z 2 = (X 2 + X 2 + ■ ■ ■ + X 2 ) + (cross terms, like 2X, X 2 ) 

==> Z 2 = (1 + 1 + • • • + 1) + (cross terms) 

=> E\Z?\ = n + E [cross terms] 

==> cr 2 = n + 0 

=> <r„ = <n. (3.105) 

The second line is the square of the first. In the third line we expanded the 

square to obtain n “diagonal” terms X 2 , along with cross terms 2XjXj. 

In the fourth line we used the fact that since Xj = ±1, its square is always 1. 
The fifth line is the expectation value of the fourth line. To obtain the sixth 
line, we used the fact that since the mean value of Z is p = 0, Eq. (3.50) gives 
E[Z 2 ] = cr 2 . And we also used the fact that the expectation value of the 
product X, Xj (with i ± j) equals zero. This is true because Xj and Xj are 
independent variables, so Eq. (3.16) tells us that E[XjXj | = E[Xj\E\Xj |. And 
these individual expectation values are zero. The standard deviation is then 
cr,, = vfn, as the seventh line states. 

Whether we find cr„ in this manner or in the manner of the first solution above 
which used Eq. (3.45) (which can be traced back to Eq. (3.27) in Theorem 3.3), 
everything boils down to the fact that the cross terms in the squared expression 
have zero expectation value. So we are left with only the diagonal terms. 

3.9. Expected product, without replacement 

When drawing two numbers without replacing the first one, all of the = N(N - 
l)/2 possibilities for the product a, a; are equally likely to be the value that X 1 X 2 
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takes. The expectation value E[X 1 X 2 ] is therefore simply the average of all the dif¬ 


ferent OjOj values. That is, 3 


E[X i* 2 ] = 


X / <j a i®j 
N(N - l)/2 


(3.106) 


Now, if we square the sum a\ + « 2 + • •• + a a? and then subtract off the “diagonal” a 2 
terms, we will be left with only the cross terms 2a,-a/. So we can rewrite the above 
numerator to obtain 


E[X iX 2 ] = 


[(X«<)' -Zaf]/2 


(3.107) 


N(N- l)/2 

where the sums run front 1 to TV. By the definition of the mean p, we have yu = 
Y, a i!N => X a i = Np. And by the definition of £[X 2 ], we have £[X 2 ] = 
2 a 2 /N ==> 2 a f ~ N ' £[A 2 ]. (We are using X to denote a random draw from 
the complete set.) But E\X 2 ] = p 2 + cr 2 from Eq. (3.50), which is true for an arbi¬ 
trary distribution, in particular the present one involving N equally likely outcomes. 
So 2 a 2 = N(fr + cr 2 ), and Eq. (3.107) becomes 


£[XiX 2 ] = 
as desired. 


(TVyu ) 2 - N(p 2 + cr 2 ) 
N(N- 1) 


N /l 1 - [T - (T 1 
N - 1 


=£ 2 - 


N- 1 


(3.108) 


Remarks: 

1. This result for £[XiX 2 ] is smaller than the E|X|X 2 J = £[Xi]£'[X 2 ] = pi 2 
result in the case where the Xj are independent , as they would be if we drew 
the numbers with replacement. This makes sense for the following reason. The 
expectation value of the product of two independently drawn numbers (which 
could be the same or different) is E[X] ■ E[X] = p 2 , whereas Eq. (3.50) tells us 
that the expectation value of the product of two identical numbers is £[X 2 ] = 
p 2 + cr 2 , which is larger than p 2 . Therefore, if we remove these identical 
cases, then the expectation value of the product of two different numbers must 
be smaller than p 2 , so that the expectation value of all of the products is p 2 . 
This reasoning is basically just Eq. (3.108) described in words. 

2. A quick corollary to Eq. (3.108) is the following. Consider a set of N given 
numbers, aj, a 2 , ..., ajy. Draw n numbers randomly without replacement. 
(So we must have n < N , of course.) Let Xj (with 1 < i < n ) be the random 
variables for these n draws. Then the expectation value of the product of any two 
of the Xi (that is, not just the first two, X\ and X 2 ) is the same as in Eq. (3.108): 

2 °" 2 

E[XiXj] = p 2 - —— (i t j). (3.109) 

This is true because the temporal ordering of the first draw through the /ith draw 
is irrelevant. We will end up with the same setup if we imagine labeling n boxes 
1 through n and then throwing (simultaneously, or in whatever temporal order 
we wish) n of the N given numbers into the n boxes, with one number in each. 
All of the p/P n ordered subgroups of n numbers are equally likely, so all of the 
expectation values E[XjXj] (with it- j) are the same. They therefore all have 
the common value of, say, E[X iX 2 ]. And this value is given in Eq. (3.108). * 

instead of writing i < j here, we can alternatively write i t j, as long as we divide by N(N - 1) 
instead of N (TV - l)/2. The denominator is modified because the sum now includes, for example, both 
a^as and a^a^. 
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3.10. Standard deviation of the mean, without replacement 

Note that we are calculating the variance here with respect to the known mean p of 
all N numbers, as opposed to the sample mean x of the n numbers we draw (as we 
did for the sample variance in Section 3.5). The latter would make the lefthand side 
of Eq. (3.91) identically zero, of course. 

If we expand the square on the lefthand side of Eq. (3.91) and also expand the square 
of 2 Xi , we obtain 


E 


[("Hi 

= E 

l^Xf + l^jXiXj ZXi , 2 I 

[ 7/2 “ 2 71 


Z E[X 2 \ + 2X,'< / E[XiXj\ ^ E [Xi 



71 2 ~ 71 


(3.110) 


E + E 2 , 


where the sums run front 1 to n. All of the E [ A,- X; | terms are equal, with their 
common value being p 2 - cr 2 /(N - 1) from the second remark in Problem 3.9; there 
are n(n - l)/2 of these terms. Likewise, all of the E\X ?] terms are equal, with their 
common value being p 2 + cr 2 from Eq. (3.50); there are n of these terms. (They are 
indeed all equal, by the same type of reasoning as in the second remark in Problem 3.9. 
The temporal ordering is irrelevant.) And E[Xj] = p, of course; there are n of these 
terms. (Again, the temporal ordering is irrelevant.) So we have 


Z*/ 


~ E 


nip 2 + cr 2 ) n{n - 1) 2 


2 

E ~ 


N - 1 


n n E 2 

- 2 — p + p 

n 


1 ln-1 
n 


^li¬ 


ra N- 1 
// - 1 


+ E - + 

1 n 


1 /; - 1 


-2 + 1 


N - 1 


(3.111) 


as desired. The p’s all cancel here, leaving us with only cr’ s. It makes sense that the 
variance shouldn’t depend on p, because if we increase all of the N given numbers 
by a particular value b, then b will cancel out in the difference (2 X{)/n - p on the 
lefthand side of Eq. (3.111), because both (2 Xj)/n and p increase by b. 


Remark: 

1. We can check some limiting cases of Eq. (3.111). If n = 1, then the variance 
reduces to cr 2 . This is correct, because if n = 1 then we’re drawing only one 
number X. The sample mean of one number is simply itself, so the variance on 
the lefthand side of Eq. (3.111) is E\(X - p) 2 ], which is cr 2 by definition. In 
this 7i = 1 case, the “without replacement” qualifier is irrelevant. We’re drawing 
only one number, so it doesn't matter if we replace it or not. 

If /I = N , then the variance in Eq. (3.111) reduces to 0. This is correct, because 
we're drawing without replacement, so at the end of the n = N drawings, we 
must have chosen all of the N given numbers exactly once. The sample mean 
(Z X,)/N of the it = N numbers is therefore guaranteed to be the mean p of the 
entire set, so variance of the sample mean is zero. 

2. If we instead draw n numbers with replacement, then all of the n draws are 
identical processes. At all stages (from the first draw to the 77 th draw) each of 
the N numbers in the complete set has a definite probability of being drawn; it 
doesn’t matter what has already happened. (There is an equal probability of l/N 
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of picking any number on a given draw, but this equality isn’t important here.) 
We therefore simply have a distribution consisting of N possible outcomes, in 
which case the result in Eq. (3.53) for the standard deviation of the mean is 
applicable. (Eq. (3.53) holds for independent and identical trials.) The variance 
of the sample mean is therefore cr 2 //?. 

Returning to the without-ieplacement case, we see that (except for n = 1) the 
variance in Eq. (3.111) is smaller than the with-replacement variance cr 2 /n, due 
to the nonzero (n-l)/(iV-l) term that is subtracted off in Eq. (3.1 IT). It makes 
intuitive sense that the without-replacement variance is smaller than the with- 
replacement variance, because the drawings are more constrained if there is no 
replacement; there are fewer possibilities for future draws. There is therefore 
less variance in the sample mean the larger n is, to the point where there is zero 
variance if n = N. * 


3.11. Biased sample standard deviation 

Let's label the lefthand side of Eq. (3.93) as K (in the N —» oo limit). If we square 
both sides of that equation, the numerator of K 2 contains N terms of the form .v?, 
along with = N(N - l)/2 cross terms of the form 2sisj. That is. 


K 2 


(j 2 + i 2 + •• • + j 2 ,) + (2s ^2 + 2s 1 s 3 + • • • + 2 sjv_i s N ) 
N 2 


(3.112) 


If we let a = s 2 and b = s 2 in the sfab < (a + b )/2 arithmetic-geometric-mean 
inequality, we obtain 


2 Si Sj < s 2 + s). 


(3.113) 


Therefore, in the above expression for K 2 , if we replace each of the (^ cross terms 
2 SiSj with s 2 + s 2 , we obtain a result that is larger than (or equal to) K 2 . In this 

f\ f\ r\ 

modified expression for K~, a particular sj term such as appears N - 1 times (once 
with each of the other ,v? terms). Hence, 


K 2 K ( s I + s 2 + "' + S ~n'> + {N - + s 2 + " • + S N ) 


N 2 


N(s 2 +s 2 + --- + S 2 n ) 

N 2 

2 2 2 
S -+s - 2 + --- + S Z N 

N 


(3.114) 


in the N —> oo limit. Therefore, K < cr, and we have demonstrated Eq. (3.93), as 
desired. 


Remarks: The arithmetic-geometric-mean inequality, sfab < (a + b)/ 2, is very easy 
to prove. In fact, the ratio of its usefulness to proof-length is perhaps the largest of 
any mathematical result! Since the square of a number is necessarily nonnegative, we 
have 

(s/a - sfb) > 0 ==> a — 2 sTab + b > 0 =s> ——— > sTab , (3.115) 
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as desired. 

How much smaller is £[.s] than cr? Consider the n = 100 case in the example near the 
end of Section 3.5. Fig. 3.4 showed the histogram of N = 10 5 values of s 2 . We can 
take the square root of each of these N values of s 2 to obtain N values of s. And then 
we can average these N values to obtain the average value of s. The result is 0.9975, 
give or take a little, depending on the numerical run. So £[ 5 ] must be about 0.9975, 
which is 0.0025 smaller than cr = 1. This 0.9975 result is reasonable, based on the 
following (extremely hand-wavy!) argument. We're just trying to get the correct order 
of magnitude here, so we won’t be concerned with factors of order 1. 

If two values of s~ take the form of 1 + a and 1 - a, then their average equals 1, of 
course. But the average value of s, which is Vl + a + VI — a, is not equal to 1. It 
is smaller than 1, and this is the basic idea behind the fact that £[s] is smaller than 
cr. To produce some actual numbers, let’s pretend that the whole right half of the 
histogram in Fig. 3.4 is lumped together at the one-standard-deviation mark. (This is 
the hand-wavy part!) We found in the discussion of Fig. 3.4 that the standard deviation 
of s 2 is sjlcr 4 /n = 0.14. So we’ll lump the whole right half of the histogram at the 
1.14 mark. Similarly, we’ll lump the whole left half at the 0.86 mark. We then have 
just two values of s 2 , so the average value of s is (V 1.14 + V0.86)/2 = 0.9975. 
This result agrees with the above numerical 0.9975 result a little too well. We had 
no right to expect such good agreement. But in any case, it is clear that the cr - £[s] 
difference decreases with n, because the above 0.14 standard-deviation value came 
from y/2cr 4 /n, which decreases with n. * 

3.12. Variance of the sample variance 

Starting with Var(.s 2 ) = £[(j -2 - cr 2 ) 2 ]. we can rewrite this variance in the same 
manner as in Eq. (3.35): 

Var(s 2 ) = £[(i 2 - cr 2 ) 2 ] 

= £[,s 4 ] - 2£[.v 2 ] ■ cr 2 + cr 4 
= £[i 4 ] - 2cr 2 ■ cr 2 + cr 4 

= £[/]-cr 4 , (3.116) 

where we have used the fact that £[s 2 ] = cr 2 . Our task is therefore to calculate £[.v 4 ]. 
Let's rewrite the sample variance in Eq. (3.73), again in the manner of Eq. (3.35): 


, 2 = 


rf Xj (x ‘ ~ x ' >2 

= ^(('L x ?)- 2 ('L x '-) j + " j2 ) 


(3.117) 


where the sum runs from 1 to n. We have used the fact that 2 x, = nx , by the 
definition of x. Squaring the above s 2 and plugging the resulting expression for s 4 
into Eq. (3.116), we find that the variance of j - 2 is (switching from definite values x to 
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random variables X) 

Var(j 2 ) = E [ ,s 4 J - <x 4 

= ^^2 [ E [(Z x if] - 2nE [(I x t) x2 } +nlE \A ) - - 4 - 

Note that we can't combine the second and third terms of the expansion of the square 
here, as we did in Eq. (3.117), because the expression analogous to 2 X,- = nX isn’t 
valid when dealing with the square of X. That is, 2 Xr ± nX . We therefore have to 
treat the second and third terms separately. 

When calculating the three expectation values that appear in Eq. (3.118), it is much 
easier to work with random variables that are measured relative to the distribution’s 
mean fj. So let’s define the random variable Z by Z = X - /r. That is, Z,- = X,- - fi. 
etc. The Z, ’s then all have the property that E[Z,] = 0. We're effectively just shift¬ 
ing the distribution so that its mean is zero. This will greatly simplify the following 
calculations of the expectation values in Eq. (3.118). 4 Since the s 2 in Eq. (3.73) is 
independent of /r, the Var(.s’ 2 ) in Eq. (3.118) is also independent of /j. We are there¬ 
fore free to replace all the X,-’s with Z, ’s in Eq. (3.118), without changing the value of 
Var(i 2 ). 

Look at the first term in Eq. (3.118). With X,- —> Z,-, this term is E [(Z 2 ) 2 ]. When 
(Z 2 + • • • + Z 2 )“ is multiplied out, there will be n terms of the form Z 4 , which all have 
the same expectation value; call it E[Z 4 ]. And there will be = n(n— 1)/2 terms of 
the form 2Z?Z 2 , which again all have the same expectation value; call it 2£[Z 2 Z 2 ]. 
So we obtain 

E [(E z ' 2 ) 2 ] = ,l£ t z4 ] + ■ 2 E A z i\ 

= nE[Z 4 ] + n(n - 1)E[Z 2 ]E[Z|] 

= nE[Z 4 ] + n(n - l)cr 4 , (3.119) 

where we have used the fact that E[Z 2 ] = cr 2 for any Z,-, which is just Eq. (3.50) with 
/r = 0. We have also used Eq. (3.16), which holds here because the Z,- are independent 
variables. 

Now look at the second term in Eq. (3.118), with X,- —» Z,-. When Z~ = (l/;? 2 )(Zi + 
• • • + Z n ) 2 is expanded, there will be terms of the form Z? and 2Z,Zy. When the 
latter is multiplied by (Z 2 + • • • + Z 2 ), it will produce terms of the form Z;ZyZ? and 
Z,-Z 2 . Both of these contain a Z,- raised to the first power, so from Eq. (3.16) the 

expectation value will involve a factor of £[Z,], which is zero. We therefore need 

9 —2 

concern ourselves only with the Z ; r terms in Z . The second term in the parentheses 
in Eq. (3.118) then becomes 


-2nE 




: -2£[Z 4 ] - 2(n - T)cr 4 , 


(3.120) 


4 We could have used this strategy in the proof of Theorem 3.5, but it wouldn’t have saved a huge 
amount of time. The /i’s that appeared in Eqs. (3.70) and (3.71) didn’t cause much of a headache. But 
in the present solution they definitely would. 
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where we have used the fact that we ended up with the same form as the first term in 
Eq. (3.118), which we already calculated in Eq. (3.119). 

—4 

Now for the third term in Eq. (3.118), with Xi —» Z,-. When we multiply out Z = 
(l/« 4 )(Zi + • • • + Z„) 4 , we obtain five different types of terms, as you can verify. 
They are Z 4 , Zj Zj , Z?Zy, Z 2 ZjZy-, Zj-ZyZ^Z/. The second, fourth, and fifth of 
these terms involve a single power of at least one Z,-, so their expectation values are 
zero. We therefore care only about the Z 4 and ZrZ? terms. There are « of the former 

type (all with the same expectation value). And there are (")(o) = 3 n{n - 1) of the 
latter type (again all with the same expectation value). This is true because there are 
ways to pick a particular pair of (i,j) indices, and for each of these pairs there are 

( 2 ) = 6 ways to pick the two Z, ’s from the four factors of (Zj + • • • + Z n ) in Z . The 
third term in the parentheses in Eq. (3.118) is therefore 

« 2 e[z 4 ] = n 2 ■ -- (nE[Z 4 ] +3 n(n - 1 )E[Z 2 Z 2 ]) 

= -E[Z 4 ] + 3( ' ? ~ — E[Z?]g[Z?] 
n n 1 z 

1 r a, 3 (n - 1) a 

= -E[Z 4 ] + — - -cr 4 , (3.121) 

n n 

where we have used the fact that E[Z 2 ] = a 2 . Plugging the results from Eqs. (3.119), 
(3.120), and (3.121) into Eq. (3.118), and grouping the E[Z 4 ] and cr 4 terms together, 
gives 


Var(.? 2 ) = 


(n - l) 2 


E[Z 4 ] n-2 + - +cr 4 (n- 1) In - 2 + 


a- 4 . (3.122) 


If we factor out a 1/n, the coefficient of E[Z 4 ] in the parentheses becomes (n - l) 2 . 


so we obtain 


Var(s 2 ) = - 
n 

_ 1 
n 


E[Z 4 ] +cr 4 
E[Z 4 ] - cr 4 


n 2 - 2n + 3 
n — 1 
n — 3 


n — 1 


(3.123) 


which agrees with Eq. (3.94) because p 4 = E[(X - p) 4 ] = £[Z 4 ]. 

In the case where Z is a Gaussian distribution, you can use Eq. (4.123) in Problem 4.23 
to show that p 4 = 3<x 4 . Var(s 2 ) then simplifies to Var(.s 2 ) = 2 cr 4 /(n - 1). If n is 
large, then this is essentially equal to 2 cr 4 /n, as we claimed in the example near the 
end of Section 3.5. Remember that this 2cr 4 /;? result holds only in the case of a 
Gaussian distribution and large n. In contrast, the result in Eq. (3.123) is valid for any 
distribution and for any n. 

3.13. Sample variance for two dice rolls 


(a) When n = 2. the n— 1 factor in the denominator of Eq. (3.73) equals 1, so the 
sample variance of two given dice rolls with values ,vi and X 2 is 

.s 2 = ( x\ - x) 2 + (X 2 - x) 2 , (3.124) 

where x = (x\ + JC 2 )/2. Let's determine the values of s 2 that Table 1.5 produces. 
If x\ equals X 2 , then s 2 = 0. In the table, there are six such pairs (along the main 
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diagonal). If x\ and xj differ by 1, then they each differ from x by ±1/2, so 
s 2 = (1/2 ) 2 + (1/2 ) 2 = 1/2. There are ten such pairs (along the two diagonals 
adjacent to the main diagonal). Continuing in this manner, if x\ and X 2 differ 
by 2, then s 2 = l 2 + l 2 = 2; there are eight such pairs, [f A | and ±2 differ by 3, 
then .v 2 = (3/2 ) 2 + (3/2 ) 2 = 9/2; there are six such pairs. If xi and ±2 differ by 

4, then j 2 = 2 2 + 2 2 = 8 ; there are four such pairs. Finally, if x\ and X 2 differ by 

5, then .v 2 = (5/2 ) 2 + (5/2 ) 2 = 25/2; there are two such pairs. The expectation 
value of s 2 is therefore 

E[s 2 ] = — (6-0+10--+8-2 + 6 - -+ 4- 8 + 2-— | = 2.92, (3.125) 
36 \ 2 2 2 / 

which correctly equals <x 2 , as Eq. (3.74) states. If we want to instead calculate 
s 2 , we simply need to tack on a factor of ?! = 2 in the denominator. We then end 
up with £[,s 2 ] = ct 2 /2 = 1.46, in agreement with Eq. (3.65) for the n = 2 case. 

(b) Using the above results, the variance of s 2 is 
Var(i 2 ) = E[(s 2 -a 2 ) 1 } 

= 4 (6 • (0 - 2.92) 2 + 10 • (0.5 - 2.92) 2 + 8 • (2 - 2.92) 2 
36 v 

+ 6 • (4.5 - 2.92) 2 + 4 ■ (8 - 2.92) 2 + 2 • (12.5 - 2.92) 2 ) 

= 11.6. (3.126) 

We’ll now show that this agrees with Eq. (3.94) when n = 2. The calculation 
of ^4 = £[(2f - yu) 4 ] is similar to the calculation of Var(ff) in Eq. (3.20). The 
only difference is that we now have fourth powers instead of squares. So 

E4 = E[(X-3.5) 4 ] 

= 7 f(1 - 3.5 ) 4 + (2 - 3.5 ) 4 + (3 - 3.5 ) 4 
6 L 

+ (4 - 3.5 ) 4 + (5 - 3.5 ) 4 + (6 - 3.5) 4 ] 

= 14.73. (3.127) 

When n = 2, the (n - 3)/ (n - 1) factor in Eq. (3.94) equals -1, so Eq. (3.94) 
gives 

Var(i 2 ) = l (// 4 + o- 4 ) = i (14.73 + 2.92 2 ) = 11.6, (3.128) 

in a agreement with Eq. (3.126). The standard deviation of s 2 is then Vll -6 = 
3.4, which seems reasonable, considering that the six possible values of s 2 range 
from 0 to 12.5. 
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Distributions 


At the beginning of Section 3.1, we introduced the concepts of random variables 
and probability distributions. A random variable is a variable that can take on cer¬ 
tain numerical values with certain probabilities. The collection of these probabilities 
is called the probability distribution for the random variable. A probability distri¬ 
bution specifies how the total probability (which is always 1) is distributed among 
the various possible outcomes. 

In this chapter, we will discuss probability distributions in detail. In Section 4.1 
we warm up with some examples of discrete distributions, and then in Section 4.2 
we discuss continuous distributions. These involve the probability density, which is 
the main new concept in this chapter. It takes some getting used to, but we’ll have 
plenty of practice with it. In Sections 4.3—4.8 we derive and discuss a number of 
the more common and important distributions. They are, respectively, the uniform, 
Bernoulli, binomial, exponential, Poisson, and Gaussian (or normal) distributions. 

Parts of this chapter are a bit mathematical, but there’s no way around this if we 
want to do things properly. However, we’ve relegated some of the more technical 
issues to Appendices B and C. If you want to skip those and just accept the results 
that we derive there, that’s fine. But you are strongly encouraged to at least take a 
look at Appendix B, where we derive many properties of the number e, which is the 
most important number in probability and statistics. 


4.1 Discrete distributions 

In this section we’ll give a few simple examples of discrete distributions. To start 
off, consider the results from Example 3 in Section 2.3.4, where we calculated the 
probabilities of obtaining the various possible numbers of Heads in five coin flips. 
We found: 


P( 0) = 
P( 3) = 


1 

32’ 

10 

32’ 


P(l) = 
P( 4) = 


5 

32’ 

5 

32’ 


P( 2) = 
P( 5) = 


10 

32 ’ 

1 

32 ' 


(4.1) 
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These probabilities add up to 1, as they should. Fig. 4.1 shows a plot of P(n) versus 
n. The random variable here is the number of Heads, and it can take on the values 
of 0 through 5, with the above probabilities. 


Pin) 

10/32 ■■ 

5/32 ■■ 


0 1 


2 3 4 


n 


5 


Figure 4.1: The probability distribution for the number of Heads in five coin flips. 


As we’ve done in Fig. 4.1, the convention is to plot the random variable on the 
horizontal axis and the probability on the vertical axis. The collective information, 
given either visually in Fig. 4.1 or explicitly in Eq. (4.1), is the probability distri¬ 
bution. A probability distribution simply tells you what all the probabilities are for 
the values that the random variable can take. Note that P(n) in the present example 
is nonzero only if n takes on one of the discrete values, 0, 1, 2, 3, 4, or 5. It’s a 
silly question to ask for the probability of getting 4.27 Heads, because n must of 
course be an integer. The probability of getting 4.27 Heads is trivially zero. Hence 
the word “discrete” in the title of this section. 

Another simple example of a discrete probability distribution is the one for the 
six possible outcomes of the roll of one die. The random variable in this setup is the 
number on the top face of the die. If the die is fair, then all six numbers have equal 
probabilities, so the probability for each is 1/6, as shown in Fig. 4.2. 

Pin) 

1/6 ■■••••• • 


- 1 - 1 - 1 -i-i-!— n 

0 1 2 3 4 5 6 

Figure 4.2: The probability distribution for the roll of one die. 

What if the die isn’t fair? For example, what if we make the “1” face heavier 
than the others by embedding a small piece of lead in the center of that face, just 
below the surface? The die is then more likely to land with the “1” face pointing 
down. The “6” face is opposite the “1,” so the die is more likely to land with the “6” 
pointing up. Fig. 4.2 will therefore be modified by raising the “6” dot and lowering 
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the other five dots; the sum of the probabilities must still be 1, of course. P 2 through 
P$ are all equal, by symmetry. The exact values of all the probabilities depend in a 
complicated way on how the mass of the lead weight compares with the mass of the 
die, and also on the nature of both the die and the table on which the die is rolled 
(how much friction, how bouncy, etc.). 

As mentioned at the beginning of Section 3.1, a random variable is assumed to 
take on numerical values, by definition. So the outcomes of Heads and Tails for a 
single coin flip technically aren’t random variables. But it still makes sense to plot 
the probabilities as shown in Fig. 4.3, even though the outcomes on the horizontal 
axis aren’t associated with a random variable. Of course, if we define a random 
variable to be the number of Heads, then the “Heads” in the figure turns into a 1, 
and the “Tails” turns into a 0. In most situations, however, the outcomes take on 
numerical values right from the start, so we can officially label them as random 
variables. But even if they don’t, we’ll often take the liberty of still referring to the 
thing being plotted on the horizontal axis of a probability distribution as a random 
variable. 


/’(face) 

1/2 ■■ 


- 1 - 1 - face 

Tails Heads 

Figure 4.3: The probability distribution for a single coin flip. 


4.2 Continuous distributions 

4.2.1 Motivation 

Probability distributions are fairly straightforward when the random variable is dis¬ 
crete. You just list (or plot) the probabilities for each of the possible values of the 
random variable. These probabilities will always add up to 1. However, not every¬ 
thing comes in discrete quantities. For example, the temperature outside your house 
takes on a continuous set of values, as does the amount of water in a glass. (We’ll 
ignore the atomic nature of matter!) 

In finding the probability distribution for a continuous random variable, you 
might think that the procedure should be exactly the same as in the discrete case. 
That is, if our random variable is the temperature at a particular location at noon 
tomorrow, then you might think that you simply have to answer questions of the 
form; What is the probability that the temperature at noon tomorrow will be 70° 
Fahrenheit? 
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Unfortunately, there is something wrong with this question, because it is too 
easy to answer. The answer is that the probability is zero , because there is simply 
no chance that the temperature at a specific time (and a specific location) will be 
exactly 70°. If it’s 70.1°, that’s not good enough. And neither is 70.01°, nor even 
70.00000001°. Basically, since the temperature takes on a continuous set of values 
(and hence an infinite number of possible values), the probability of a specific value 
occurring is l/oo, which is zero. 1 

However, even though the above question (“What is the probability that the 
temperature at noon tomorrow will be 70°?”) is a poor one, that doesn’t mean we 
should throw in the towel and conclude that probability disttibutions don’t exist for 
continuous random variables. They do in fact exist, because there are some useful 
questions we can ask. These useful questions take the general form of: What is 
the probability that the temperature at a particular location at noon tomorrow lies 
somewhere between 69° and 71°? This question has a nontrivial answer, in the 
sense that it isn’t automatically zero. And depending on what the forecast is for 
tomorrow, the answer might be something like 20%. 

We can also ask: What is the probability that the temperature at noon lies some¬ 
where between 69.5° and 70.5°? The answer to this question is smaller than the 
answer to the previous one, because it involves a range of only one degree instead 
of two degrees. If we assume that inside the range of 69° to 71° the temperature is 
equally likely to be found anywhere (which is a reasonable approximation although 
undoubtedly not exactly correct), and if the previous answer was 20%, then the 
present answer is (roughly) 10%, because the range is half the size. 

The point here is that the smaller the range, the smaller the chance that the tem¬ 
perature lies in that range. Conversely, the larger the range, the larger the chance 
that the temperature lies in that range. Taken to an extreme, if we ask for the prob¬ 
ability that the temperature at noon lies somewhere between -100° and 200°, then 
the answer is exactly equal to 1 (ignoring liquid nitrogen spills, forest fires, and such 
things!). 

In addition to depending on the size of the range, the probability also of course 
depends on where the range is located on the temperature scale. For example, the 
probability that the temperature at noon lies somewhere between 69° and 71° is 
undoubtedly different from the probability that it lies somewhere between 11° and 
13°. Both ranges have a span of two degrees, but if the given day happens to be 
in late summer, the temperature is much more likely to be around 70° than to be 
sub-freezing (let’s assume we’re in, say, Boston). To actually figure out the proba¬ 
bilities, many different pieces of data would have to be considered. In the present 
temperature example, the data would be of the meteorological type. But if we were 
interested in the probability that a random person is between 69 and 71 inches tall, 
then we’d need to consider a whole different set of data. 

The lesson to take away from all this is that if we’re looking at a random variable 
that can take on a continuous set of values, the probability that this random variable 
falls into a given range depends on three things. It depends on: 

J Of course, if you’re using a digital thermometer that measures the temperature to the nearest tenth of 
a degree, then it does make sense to ask for the probability that the thermometer reads, say, 70.0 degrees. 
This probability is generally nonzero. This is due to the fact that the reading on the digital thermometer 
is a discrete random variable, whereas the actual temperature is a continuous random variable. 
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1. the location of the range, 

2. the size of the range, 

3. the specifics of the situation we’re dealing with. 

The third of these is what determines the probability density, which is a function 
whose argument is the location of the range. We’ll now discuss probability densities. 


4.2.2 Probability density 

Consider the plot in Fig. 4.4, which gives a hypothetical probability distribution for 
the temperature example we’ve been discussing. This plot shows the probability 
distribution on the vertical axis, as a function of the temperature T (the random 
variable) on the horizontal axis. We have chosen to measure the temperature in 
Fahrenheit. We’re denoting the probability distribution by 2 p(T) instead of P(T), 
to distinguish it from the type of probability distribution we’ve been talking about 
for discrete variables. The reason for this new notation is that p(T) is a probability 
density and not an actual probability. We’ll talk about this below. When writing 
the functional form of a probability distribution, we’ll denote probability densities 
with lowercase letters, like the p in p(T) or the / in f(x). And we’ll denote actual 
probabilities with uppercase letters, like the P in Pin). 

P (T) 



Figure 4.4: A hypothetical probability distribution for the temperature. 


We haven’t yet said exactly what we mean by p(T). But in any case, it’s clear 
from Fig. 4.4 that the temperature is more likely to be near 70° than near 60°. The 
following definition of p(T) allows us to be precise about what we mean by this. 


2 As mentioned at the beginning of Section 3.1, a random variable is usually denoted with an up¬ 
percase letter, while the actual values are denoted with lowercase letters. So we should technically be 
writing p(t) here. But since an uppercase T is the accepted notation for temperature, we’ll use T for the 
actual value. 
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• Definition of the probability density function, p(T): 

p(T ) is the function of T that, when multiplied by a small inten’al AT, gives 
the probability that the temperature lies between T and T + AT. That is, 

Pftemp lies between T and T + AT) = p(T) ■ AT. (4.2) 

Note that the lefthand side contains an actual probability P, whereas the righthand 
side contains a probability density, p(T). The latter needs to be multiplied by a 
range of T (or whatever quantity we’re dealing with) in order to obtain an actual 
probability. The above definition is relevant to any continuous random variable, of 
course, not just temperature. 

Eq. (4.2) might look a little scary, but a few examples should clear things up. 
From Fig. 4.4, it looks like p( 70°) is about 0.07. So if we pick AT = 1°, we find 
that the probability of the temperature lying between 70° and 71° is about 

p(T) ■ AT = (0.07)(1) = 0.07 = 7%. (4.3) 

If we instead pick a smaller AT, say 0.5°, we find that the probability of the tem¬ 
perature lying between 70° and 70.5° is about (0.07)(0.5) = 3.5%. And if we pick 
an even smaller AT, say 0.1°, we find that the probability of the temperature lying 
between 70° and 70.1° is about (0.07)(0.1) = 0.7%. 

Similarly, we can apply Eq. (4.2) to any other value of T. For example, it looks 
like p(60°) is about 0.02. So if we pick AT = 1°, we find that the probability of the 
temperature lying between 60° and 61° is about (0.02)(1) = 2%. And as above, we 
can pick other values of AT too. 

Note that, in accordance with Eq. (4.2), we have been using the value of p at the 
lower end of the given temperature interval. That is, when the interval was 70° to 
71°, we used p( 70°) and then multiplied this by AT. But couldn’t we just as well 
use the value of p at the upper end of the interval? That is, couldn’t the righthand 
side of Eq. (4.2) just as well be p(T + AT) ■ ATI Indeed it could. But as long as 
AT is small, it doesn’t matter much which value of p we use. They will both give 
essentially the same answer. See the second remark below. 

Remember that three inputs are necessary when finding the probability that the 
temperature lies in a specified range. As we noted at the end of Section 4.2.1, the 
first input is the value of T we’re concerned with, the second is the range AT, and 
the third is the information encapsulated in the probability density function, p(T), 
evaluated at the given value of T. The latter two of these three quantities are the two 
quantities that are multiplied together on the righthand side of Eq. (4.2). Knowing 
only one of these isn’t enough to give you a probability. 

To recap, there is a very important difference between the probability distribu¬ 
tion for a continuous random variable and that for a discrete random variable. For 
a continuous variable, the probability distribution consists of a probability density. 
But for a discrete variable, it consists of actual probabilities. We plot a density for a 
continuous distribution, because it wouldn’t make sense to plot actual probabilities, 
since they’re all zero. This is true because the probability of obtaining exactly a 
particular value is zero, since there is an infinite number of possible values. 

Conversely, we plot actual probabilities for a discrete distribution, because it 
wouldn’t make sense to plot a density, since it consists of a collection of infinite 
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spikes. This is true because on a die roll, for example, there is a 1/6 chance of 
obtaining a number between, say, 4.9999999 and 5.0000001. The probability den¬ 
sity at the outcome of 5, which from Eq. (4.2) equals the probability divided by the 
interval length, is then (l/6)/(0.0000002), which is huge. And the interval can be 
made arbitrarily small, which means that the density is arbitrarily large. To sum up, 
the term “probability distribution” applies to both continuous and discrete variables, 
whereas the term “probability density” applies only to continuous variables. 

Remarks: 

1. p(T) is a function of T, so it depends on what units we’re using to measure T. We used 
Fahrenheit above, but what if we instead want to use Celsius? Problem 4.1 addresses 
this issue (but you will need to read Section 4.2.3 first). 

2. Note the inclusion of the word “small” in the definition of the probability density in 
Eq. (4.2). The reason for this word is that we want p{T) to be (roughly) constant over 
the specified range. If AT is small enough, then this is approximately true. If p(T) 
varied greatly over the range of AT, then it wouldn't be clear which value of p(T) 
we should multiply by AT to obtain the probability. The point is that if AT is small 
enough, then all of the p(T) values are roughly the same, so it doesn’t matter which 
one we pick. 

An alternative definition of the density p(T) is 

P(temp lies between T - (AT)/2 and T + (AT)/2) = p(T) ■ AT. (4.4) 

The only difference between this definition and the one in Eq. (4.2) is that we’re now 
using the value of p(T) at the midpoint of the temperature range, instead of the left- 
end value we used in Eq. (4.2). Both definitions are equally valid, because they give 
essentially the same result for p{T). provided that AT is small. Similarly, we could 
use the value of p(T) at the right end of the temperature range. 

How small do we need AT to be? The answer to this will be evident when we talk 
about probability in terms of area in Section 4.2.3. In short, we need the change in 
p(T) over the span of AT to be small compared with the values of p{T) in that span. 

3. The probability density function involves only (1) the value of T (or whatever) we're 
concerned with, and (2) the specifics of the situation at hand (meteorological data in 
the above temperature example, etc.). The density is completely independent of the 
arbitrary value of AT that we choose. This is how things work with any kind of density. 

For example, consider the mass density of gold. This mass density is a property of the 
gold itself. More precisely, it is a function of each point in the gold. For pure gold, the 
density is constant throughout the volume, but we could imagine impurities that would 
make the mass density be a varying function of position, just as the above probability 
density is a varying function of temperature. Let’s call the mass density p(r), where 
r signifies the possible dependence of p on the location of a given point within the 
volume, (The position of a given point can be described by the vector pointing from 
the origin to the point. And vectors are generally denoted by boldface letters like r.) 
Let’s call the small volume we're concerned with AV. Then the mass in the small 
volume AV is given by the product of the density and the volume, that is, p(r) • AV. 
This is directly analogous to the fact that the probability in the above temperature 
example is given by the product of the probability density and the temperature span, 
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that is, p(T) ■ AT. The correspondence among the various quantities is 


Mass in AT around location r 


Prob that temp lies in AT around T 


p( r) <=> p(T) 

AT <=> AT. * (4.5) 


4.2.3 Probability equals area 

The graphical interpretation of the product p(T) ■ AT in Eq. (4.2) is that it is the 
area of the rectangle shown in Fig. 4.5. This is true because AT is the base of the 
rectangle, and p(T) is the height. 

P(T) 



Figure 4.5: Interpretation of the product p(T) ■ AT as an area. 

We have chosen AT to be 2° in the figure. With this choice, the area of the rectangle, 
which equals p(70°) • (2°), gives a reasonably good approximation to the probability 
that the temperature lies between 70° and 72°. But it isn’t exact, because p(T) isn’t 
constant over the 2° interval. A better approximation to the probability that the 
temperature lies between 70° and 72° is achieved by splitting the 2° interval into 
two intervals of 1° each, and then adding up the probabilities of lying in each of 
these two intervals. These two probabilities are approximately equal to p(70°) • (1°) 
and p(71°) • (1°), and the two corresponding rectangles are shown in Fig. 4.6. 

But again, the sum of the areas of these two rectangles is still only an approx¬ 
imate result for the true probability that the temperature lies between 70° and 72°, 
because p(T) isn’t constant over the 1° intervals either. A better approximation is 
achieved by splitting the 1° intervals into smaller intervals, and then again into even 
smaller ones. And so on. When we get to the point of having 100 or 1000 extremely 
thin rectangles, the sum of their areas will essentially be the area shown in Fig. 4.7. 
This area is the correct probability that the temperature lies between 70° and 72°. 
So in retrospect, we see that the rectangular area in Fig. 4.5 exceeds the true prob¬ 
ability by the area of the tiny triangular-ish region in the upper righthand corner of 
the rectangle. 

We therefore arrive at a more precise definition (compared with Eq. (4.2)) of the 
probability density, p(T): 
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Figure 4.6: Subdividing the area, to produce a better approximation to the probability. 
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Figure 4.7: The area below the curve between 70° and 72° equals the probability that the 
temperature lies between 70° and 72°. 

• Improved definition of the probability density function, p(T)\ 

p(T) is the function of T for which the area under the p(T) cun’e between T 
and T + AT gives the probability that the temperature (or whatever quantity 
we’re dealing with) lies between T and T + AT. 

This is an exact definition, and there is no need for AT to be small, as there was in the 
definition in Eq. (4.2). The difference is that the present definition involves the exact 
area, whereas Eq. (4.2) involved the area of a rectangle (via simple multiplication 
by AT), which was only an approximation. But technically the only thing we need 
to add to Eq. (4.2) is the requirement that we take the AT —» 0 limit. That makes 
the definition rigorous. 

The total area under any probability density curve must be 1, because this area 
equals the probability that the temperature (or whatever) takes on some value be¬ 
tween -oo and +oo, and because every possible result is included in the — oo to +oo 
range. However, in any realistic case, the density is essentially zero outside a spe¬ 
cific finite region. So there is essentially no contribution to the area from the parts 
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of the plot outside that region. There is therefore no need to go to ±oo. The total 
area under each of the curves in the above figures, including the tails on either side 
which we haven’t bothered to draw, is indeed equal to 1 (at least roughly; the curves 
were drawn by hand). 

Given a probability density function f(x), the cumulative distribution function 
F(x ) is defined to be the probability that X takes on a value that is less than or 
equal to x. That is, F(x ) = P(X < x). For a continuous distribution, this definition 
implies that F{x ) equals the area under the f(x) curve from -oo up to the given x 
value. A quick corollary is that the probability P(a < x < b) that x lies between 
two given values a and b is equal to F(b) - F(a). For a discrete distribution, 
the definition F(x) = P(X < x) still applies, but we now calculate P(X < x) 
by forming a discrete sum instead of finding an area. Although the cumulative 
distribution function can be very useful in probability and statistics, we won’t use it 
much in this book. 

We’ll now spend a fair amount of time in Sections 43-4.8 discussing some 
common types of probability distributions. There is technically an infinite number 
of possible distributions, although only a hundred or so come up frequently enough 
to have names. And even many of these are rather obscure. A handful, however, 
come up again and again in a variety of settings, so we’ll concentrate on these. 
They are the uniform, Bernoulli, binomial, exponential, Poisson, and Gaussian (or 
normal) distributions. 


4.3 Uniform distribution 


We’ll start with a very simple continuous probability distribution, one that is uni¬ 
form over a given interval, and zero otherwise. Such a distribution might look like 
the one shown in Fig. 4.8. If the distribution extends from x\ to X 2 , then the value 
of p(x) in that region must be l/(x 2 - Xi), so that the total area is 1. 


p(x) 


l/(*2-*i) 


X, x 2 


X 


Figure 4.8: A uniform distribution. 

This type of distribution could arise, for example, from a setup where a rubber 
ball bounces around in an empty rectangular room. When it finally comes to rest, 
we measure its distance x from a particular one of the walls. If you initially throw 
the ball hard enough, then it’s a pretty good approximation to say that x is equally 
likely to take on any value between 0 and L, where L is the length of the room in 
the relevant direction. In this setup, the xi in Fig. 4.8 equals 0 (so we would need 
to shift the rectangle to the left), and the x 2 equals L. 
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The random variable here is X, and the value it takes is denoted by ;c. So x is 
what we plot on the horizontal axis. Since we’re dealing with a continuous distri¬ 
bution, we plot the probability density (not the probability!) on the vertical axis. If 
L equals 10 feet, then outside the region 0 < x < 10, the probability density p(x) 
equals zero. Inside this region, the density equals the total probability divided by 
the total interval, which gives 1 per 10 feet, or equivalently 1/10 per foot. If we 
want to find the actual probability that the ball ends up between, say, x = 6 and 
x = 8, then we just multiply p(x) by the interval length, which is 2 feet. The result 
is (1/10 per foot)(2 feet), which equals 2/10 = 1/5. This makes sense, of course, 
because the 2-foot interval is 1/5 of the total distance. 

A uniform density is easy to deal with, because the area under a given part 
of the curve (which equals the probability) is simply a rectangle. And the area 
of a rectangle is just the base times the height, which is the interval length times 
the density. This is exactly the product we formed above. When the density isn’t 
uniform, it can be very difficult sometimes to find the area under a given part of the 
curve. 

Note that the larger the region of nonzero p(x) in a uniform distribution, the 
smaller the value of p{x). This follows from the fact that the total area under the 
density “curve” (which is just a straight line segment in this case) must equal 1. So 
if the base becomes longer, the height must become shorter. 


4.4 Bernoulli distribution 

We’ll now consider a very simple discrete distribution, called the Bernoulli distri¬ 
bution. This is the distribution for a process in which only two possible outcomes, 
1 and 0, can occur, with probabilities p and 1 - p, respectively. (They must add up 
to 1, of course.) The plot of this probability distribution is shown in Fig. 4.9. It is 
common to call the outcome of 1 a success and the outcome of 0 a failure. A special 
case of a Bernoulli distribution is the distribution for a coin toss, where the proba¬ 
bilities for Heads and Tails (which we can assign the values of 1 and 0, respectively) 
are both equal to 1/2. 


P 


P -- 
1 -p->r 




0 1 

Figure 4.9: A Bernoulli distribution takes on the values 1 and 0 with probabilities p and 
1 -p. 


The Bernoulli distribution is the simplest of all distributions, with the exception 
of the trivial case where only one possible outcome can occur, which therefore has 
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a probability of 1. The uniform and Bernoulli distributions are simple enough that 
there isn’t much to say. In contrast, the distributions in the following four sections 
(binomial, exponential, Poisson, and Gaussian) are a bit more interesting, so we’ll 
have plenty to say about them. 


4.5 Binomial distribution 

The binomial distribution, which is discrete, is an extension of the Bernoulli dis¬ 
tribution. The binomial distribution is defined to be the probability distribution for 
the total number of successes that arise in an arbitrary number of independent and 
identically distributed Bernoulli processes. An example of a binomial distribution is 
the probability distribution for the number of Heads in, say, five coin tosses, which 
we discussed in Section 4.1. We could just as well pick any other number of tosses. 

In the case of five coin tosses, each coin toss is a Bernoulli process. When 
we put all five tosses together and look at the total number of successes (Heads), 
we get a binomial distribution. Let’s label the total number of successes as k. In 
this specific example, there are n = 5 Bernoulli processes, with each one having a 
p = 1/2 probability of success. The probability distribution P(k) is simply the one 
we plotted earlier in Fig. 4.1, where we counted the number of Heads. 

Let’s now find the binomial distribution associated with a general number n of 
independent Bernoulli trials, each with the same probability of success, p. So our 
goal is to find the value of P(k) for all of the different possible values of the total 
number of successes, k. The possible values of k range from 0 up to the number of 
trials, n. 

To calculate the binomial distribution (for given n and p), we first note that p k is 
the probability that a specific set of k of the n Bernoulli processes all yield success, 
because each of the k processes has a p probability of yielding success. We then 
need the other n — k processes to not yield success, because we want exactly k 
successes. This happens with probability (1 - p) n ~ k , because each of the n — k 
processes has a 1 — p probability of yielding failure. The probability that a specific 
set of k processes (and no others) all yield success is therefore p k • (1 - p)"~ k . 
Finally, since there are ways to pick a specific set of k processes, we see that 
the probability that exactly k of the n processes yield success is 


P(k) = 




(binomial distribution) 


(4.6) 


This is the desired binomial distribution. Note that this distribution depends on 
two parameters - the number n of Bernoulli trials and the probability p of success in 
each trial. If you want to make these parameters explicit, you can write the Binomial 
distribution P(k) as B n p {k). That is, 

Bn.p(k) = (”)p*(l -P) n ~ k - (4.7) 

But we’ll generally just use the simple P(k) notation. 
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In the special case of a binomial distribution generated from n coin tosses, we 
have p = 1/2. So Eq. (4.6) gives the probability of obtaining k Heads as 

'■<*) = (48) 

To recap: In Eq. (4.6), n is the total number of Bernoulli processes, p is the prob¬ 
ability of success in each Bernoulli process, and k is the total number of successes 
in the n processes. (So k can be anything from 0 to n.) Fig. 4.10 shows the binomial 
distribution for the cases of n = 30 and p — 1/2 (which arises from 30 coin tosses), 
and n = 30 and p = 1/6 (which arises from 30 die rolls, with a particular one of the 
six numbers representing success). 
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Figure 4.10: Two binomial distributions with n = 30 but different values of p. 


Example (Equal probabilities): Given n, for what value of p is the probability of 
zero successes equal to the probability of one success? 

Solution: In Eq. (4.6) we want P( 0) to equal P( 1). This gives 

=> 11(1 ~P) n = n • p • — p) n l 

1 

==> 1 ~P = np ==> p = -- . (4.9) 

n + 1 

This p = l/(n + 1) value is the special value of p for which various competing effects 
cancel. On one hand, P( 1) contains an extra factor of n from the ) coefficient, which 
arises from the fact that there are n different ways for one success to happen. But on 
the other hand, P( 1) also contains a factor of p, which arises from the fact that one 
success does happen. The first of these effects makes P( 1) larger than P( 0), while the 
second makes it smaller. 3 The effects cancel when p = 1 /(n + 1). Fig. 4.11 shows the 
plot for n = 10 and p = 1/11. 

The p = 1 /(« + 1) case is the cutoff between the maximum of P(k) occurring when 
k is zero or nonzero. If p is larger than I /in + 1), as it is in both plots in Fig. 4.10 


3 Another effect is that P(l) is larger because it contains one fewer factor of (1 - p ). But this effect 
is minor when p is small, which is the case if n is large, due to the p = 1 /in + 1) form of the answer. 
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Figure 4.11: .P(0) equals P(l) if p = l/(n + 1). 


above, then the maximum occurs at a nonzero value of k. That is, the distribution has 
a bump. On the other hand, if p is smaller than l/(n + 1), then the maximum occurs 
at k = 0. That is, the distribution has its peak at k = 0 and falls off from there. 


Having derived the binomial distribution in Eq. (4.6), there is a simple double 
check that we can perform on the result. Since the number of successes, k , can take 
on any integer value from 0 to n, the sum of the P(k) probabilities from k - 0 
to k = n must equal 1. The P(k) expression in Eq. (4.6) does indeed satisfy this 
requirement, due to the binomial expansion, which tells us that 



(4.10) 


This is just Eq. (1.21) from Section 1.8.3, with a = p and b — 1 — p. The lefthand 
side of Eq. (4.10) is simply 1" = 1. And each term in the sum on the righthand side 
is a P(k) term from Eq. (4.6). So Eq. (4.10) becomes 


n 



(4.11) 


k= 0 


as we wanted to show. You are encouraged to verify this result for the probabilities 
in, say, the left plot in Fig. 4.10. Feel free to make rough estimates of the probabili¬ 
ties when reading them off the plot. You will find that the sum is indeed 1, up to the 
rough estimates you make. 

The task of Problem 4.4 is to use Eq. (3.4) to explicitly demonstrate that the ex¬ 
pectation value of the binomial distribution in Eq. (4.6) equals pn. In other words, 
if our binomial distribution is derived from n Bernoulli trials, each having a prob¬ 
ability p of success, then we should expect a total of pn successes (on average, if 
we do a large number of sets of n trials). This must be true, of course, because a 
fraction p of the n trials yield success, on average, by the definition of p for the 
given Bernoulli process. 
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Remark: We should emphasize what is meant by a probability distribution. Let’s say that 
you want to experimentally verify that the left plot in Fig. 4.10 is the correct probability 
distribution for the total number of Heads that show up in 30 coin flips. You of course 
can’t do this by flipping a coin just once. And you can’t even do it by flipping a coin 30 
times, because all you’ll get from that is just one number for the total number of Heads. For 
example, you might obtain 17 Heads. In order to experimentally verify the distribution, you 
need to perform a large number of sets of 30 coin flips, and you need to record the total 
number of Heads you get in each 30-flip set. The result will be a long string of numbers 
such as 13,16,15,16,18,14, 11,17, .... If you then calculate the fractions of the time that 
each number appears, these fractions should (roughly) agree with the probabilities shown in 
Fig. 4.10. The longer the string of numbers, the better the agreement, in general. The main 
point here is that the distribution does’t say much about one particular set of 30 flips. Rather, 
it says what the expected distribution of outcomes is for a large number of sets of 30 flips. * 


4.6 Exponential distribution 

In Sections 4.6—4.8 we’ll look at three probability distributions (exponential, Pois¬ 
son, and Gaussian) that are a bit more involved than the three we’ve just discussed 
(uniform, Bernoulli, and binomial). We’ll start with the exponential distribution, 
which takes the general form, 

pit) = Ae~ bt , (4.12) 

where A and b are quantities that depend on the specific situation at hand. We will 
find below in Eq. (4.26) that these quantities must be related in a certain way in 
order for the total probability to be 1. The parameter t corresponds to whatever 
the random variable is. The exponential distribution is a continuous one, so p(t) is 
a probability density. The most common type of situation where this distribution 
arises is the following. 

Consider a repeating event that happens completely randomly in time. By “com¬ 
pletely randomly” we mean that there is a uniform probability that the event happens 
at any given instant (or more precisely, in any small time interval of a given length), 
independent of what has already happened. That is, the process has no “memory.” 
The exponential distribution that we’ll eventually arrive at (after a lot of work!) in 
Eq. (4.26) gives the probability distribution for the waiting time until the next event 
occurs. Since the time t is a continuous quantity, we’ll need to develop some for¬ 
malism to analyze the distribution. To ease into it, let’s start with the slightly easier 
case where time is assumed to be discrete. 

4.6.1 Discrete case 

Consider a process where we roll a hypothetical 10-sided die once every second. So 
time is discretized into 1-second intervals. It’s actually not necessary to introduce 
time here at all. We could simply talk about the number of iterations of the process. 
But it’s easier to talk about things like the “waiting time” than the “number of iter¬ 
ations you need to wait for.” So for convenience, we’ll discuss things in the context 
of time. 

If the die shows a “1,” we’ll consider that a success. The other nine numbers rep¬ 
resent failure. There are two reasonable questions we can ask: What is the average 
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waiting time (that is, the expectation value of the waiting time) between successes? 
And what is the probability distribution of the waiting times between successes? 

Average waiting time 

It is fairly easy to determine the average waiting time. There are 10 possible num¬ 
bers on the die, so on average we can expect 1/10 of them to be l’s. If we run the 
process for a long time, say, an hour (which consists of 3600 seconds), then we 
can expect about 360 l’s. The average waiting time between successes is therefore 
(3600 seconds)/360 = 10 seconds. 

More generally, if the probability of success in each trial is p , then the average 
waiting time is 1 jp (assuming that the trials happen at 1-second intervals). This 
can be seen by the same reasoning as above. If we perform n trials of the process, 
then pn of them will yield success, on average. The average waiting time between 
successes is the total time (n) divided by the number of successes (pn): 

n 1 

Average waiting time = — = - . (4.13) 

pn p 

Note that the preceding reasoning gives us the average waiting time, without 
requiring any knowledge of the actual probability distribution of the waiting times 
(which we will calculate below). Of course, once we do know what the probability 
distribution is, we should be able to calculate the average (the expectation value) of 
the waiting times. This is the task of Problem 4.7. 


Distribution of waiting times 


Finding the probability distribution of the waiting times requires a little more work 
than finding the average waiting time. For the 10-sided die example, the question 
we’re trying to answer is: What is the probability that if we consider two successive 
l’s, the time between them will be 6 seconds? Or 30 seconds? Or 1 second? And 
so on. Although the average waiting time is 10 seconds, this certainly doesn’t mean 
that the waiting time will always be 10 seconds. In fact, we will find below that the 
probability that the waiting time is exactly 10 seconds is quite small. 

Let’s be general and say that the probability of success in each trial is p (so 
p = 1/10 in our present setup). Then the question is: What is the probability, P(k), 
that we will have to wait exactly k iterations (each of which is 1 second here) to 
obtain the next success? 

To answer this, note that in order for the next success to happen on the kth 
iteration, there must be failure (which happens with probability 1 - p) on the first 
k — 1 iterations, and then success on the A th one. The probability of this happening 


is 


P(k) = (l-p) k ~ l p 


(geometric distribution) 


(4.14) 


This is the desired (discrete) probability distribution for the waiting time. This 
distribution goes by the name of the geometric distribution , because the probabilities 
form a geometric progression, due to the increasing power of the (1 -p) factor. The 
geometric distribution is the discrete version of the exponential distribution that 
we’ll arrive at in Eq. (4.26) below. 
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Eq. (4.14) tells us that the probability that the next success comes on the very 
next iteration is p, the probability that it comes on the second iteration is (1 - p)p, 
the probability that it comes on the third iteration is (1 - p) 2 p, and so on. Each 
probability is smaller than the previous one by the factor (1 — p). A plot of the 
distribution for p = 1/10 is shown in Fig. 4.12. The distribution is maximum at 
k = 1 and falls off from that value. Even though k = 10 is the average waiting 
time, the probability of the waiting time being exactly k — 10 is only Pi 10) = 
(0.9) 9 (0.1) « 0.04 = 4%. 
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Figure 4.12: The geometric distribution with p = 1/10. 


If p is large (close to 1), the plot of P(k) starts high (at p, which is close to 1) 
and then falls off quickly, because the factor (1 — p) is close to 0. On the other hand, 
if p is small (close to 0), the plot of P(k) starts low (at p, which is close to 0) and 
then falls off slowly, because the factor (1 — p) is close to 1. 

As a double check on the result in Eq. (4.14), we know that the next success has 
to eventually happen sometime , so the sum of all the P(k) probabilities must be 1. 
These P(k) probabilities form a geometric series whose first term is p and whose 
ratio is 1 - p. The general formula for the sum of a geometric series with first term 
a and ratio r is a/( 1 - r), so we have 

P(l) + P( 2) + P( 3) + .--=p + p(l-p)+ pi 1 - p) 2 + ■ ■ ■ 

_ P 

l-(l-p) 

= 1, (4.15) 

as desired. As another check, we can verify that the expectation value (the average) 
of the waiting time for the geometric distribution in Eq. (4.14) equals 1 Ip, as we 
already found above; see Problem 4.7. 

You are encouraged to use a coin to experimentally “verify” Eq. (4.14) (or equiv¬ 
alently, the plot analogous to Fig. 4.12) for the case of p — 1/2. Just flip a coin as 
many times as you can in ten minutes, each time writing down a 1 if you get Heads 
and a 0 if you get Tails. Then make a long list of the waiting times between the l’s. 
Then count up the number of one-toss waits, the number of two-toss waits, and so 
on. Then divide each of these numbers by the total number of waits (not the total 
number of tosses!) to find the probability of each waiting length. The results should 
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be (roughly) consistent with Eq. (4.14) for p = 1/2. In this case, the probabilities in 
Eq. (4.14) for k = 1, 2, 3, 4,... are 1/2, 1/4, 1/8, 1/16, .... 

4.6.2 Rates, expectation values, and probabilities 

Let’s now consider the case where time is a continuous quantity. That is, let’s as¬ 
sume that we can have a “successful” event at any instant, not just at the evenly- 
spaced 1-second marks as above. A continuous process whose probability is uni¬ 
form in time can be completely described by just one number - the average rate of 
success, which we’ll call A. We generally won’t bother writing the word “average,” 
so we’ll just call A the “rate.” Before getting into the derivation of the continuous 
exponential distribution in Section 4.6.3, we’ll need to talk a little about rates. 

The rate A can be determined by counting the number of successful events that 
occur during a long time interval, and then dividing by this time. For example, if 
300 (successful) events happen dining 100 minutes, then the rate A is 3 events per 
minute. Of course, if you count the number of events in a different span of 100 
minutes, you will most likely get a slightly different number, perhaps 313 or 281. 
But in the limit of a very long time interval, you will find essentially the same rate, 
independent of which specific long interval you use. 

If the rate A is 3 events per minute, you can alternatively write this as 1 event 
per 20 seconds, or 1/20 of an event per second. There is an infinite number of ways 
to write A , and it’s personal preference which one you pick. Just remember that you 
have to state the “per time” interval you’re using. If you just say that the rate is 3, 
that doesn’t mean anything. 

What is the expectation value of the number of events that happen during a time 
f? This expected number simply equals the product At, from the definition of A. 
If the expected number were anything other than At, then if we divided it by t to 
obtain the rate, we wouldn’t get A. If you want to be a little more rigorous, consider 
a very large number n of intervals with length t. The total time in these intervals 
is nt. This total time is very large, so the number of events that happen during this 
time is (approximately) equal to ( nt)A, by the definition of A. The expected number 
of events in each of the n intervals with length t is therefore ntA/n - At, as above. 
So we can write 


(Expected number of events in time t ) = At 


(4.16) 


In the above setup where A equals 3 events per minute, the expected number of 
events that happen in, say, 5 minutes is 


At = (3 events per minute) (5 minutes) = 15 events. (4.17) 

Does this mean that we are guaranteed to have exactly 15 events during a particular 
5-minute span? Absolutely not. We can theoretically have any number of events, 
although there is essentially zero chance that the number will differ significantly 
from 15. (The probability of obtaining the various numbers of events is governed 
by the Poisson distribution, which we’ll discuss in Section 4.7.) But the expectation 
value is 15. That is, if we perform a large number of 5-minute trials and then 
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calculate the average number of events that occur in each trial, the result will be 
close to 15. 

A trickier question to ask is: What is the probability that exactly one event 
happens during a time tl Since A is the rate, you might think that you can just 
multiply A by t, as we did above, to say that the probability is At. But this certainly 
can’t be correct, because it would imply a probability of 15 for a 5-minute interval 
in the above setup. This is nonsense, because probabilities can’t be larger than 1. If 
we instead pick a time interval of 20 seconds (1 /3 of a minute), we obtain a At value 
of 1. This doesn’t have the fatal flaw of being larger than 1, but it has another issue, 
in that it says that exactly one event is guaranteed to happen during a 20-second 
interval. This can’t be correct either, because it’s certainly possible for zero (or 
two or three, etc.) events to occur. We’ll figure out the exact probabilities of these 
numbers in Section 4.7. 

The strategy of multiplying A by t to obtain a probability doesn’t seem to work. 
However, there is one special case where it does work. If the time interval is ex¬ 
tremely small (let’s call it e, which is a standard letter to use for something that is 
very small), then it is true that the probability of exactly one event occurring during 
the e time interval is essentially equal to Ae. We’re using the word “essentially” 
because, although this statement is technically not true, it becomes arbitrarily close 
to being true in the limit where e approaches zero. In the above example with 
A = 1/20 events per second, the statement, “At is the probability that exactly one 
event happens during a time f,” is a lousy approximation if t — 20 seconds, a decent 
approximation if 1 — 2 seconds, and a very good approximation if t - 0.2 seconds. 
And it only gets better as the time interval gets smaller. We’ll explain why in the 
first remark below. 

We can therefore say that if P £ { 1) stands for the probability that exactly one 
event happens during a small time interval e, then 


f’e(l) ~ Ae 


(if e is very small) 


(4.18) 


The smaller e is, the better this approximation is. Technically, the condition in 
Eq. (4.18) is really “if Ae is very small.” But we’ll generally be dealing with “nor¬ 
mal” sized A’s, so Ae being small is equivalent to e being small. When we deal with 
continuous time below, we’ll actually be taking the e —» 0 limit. In this mathemati¬ 
cal limit, the sign in Eq. (4.18) becomes an exact “=” sign. To sum up: 


• If t is very small, then At is both the expected number of events that hap¬ 
pen during the time t and (essentially) the probability that exactly one event 
happens during the time t. 

• If t isn’t very small, then At is only the expected number of events. 


Remarks: 

1. We claimed above that At equals the probability of exactly one event occurring, only 
if t is very small. The reason for this restriction is that if t isn’t small, then there is the 
possibility of multiple events occurring during the time t. We can be explicit about this 
as follows. Since we know from Eq. (4.16) that the expected number of events during 
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any time t is At, we can use the expression for the expectation value in Eq. (3.4) to 
write 


At = P, (0) • 0 + P,{ 1) • 1 + P,( 2) • 2 + P,( 3) • 3 + • • • , (4.19) 

where P t ( k ) is the probability of obtaining exactly k events during the time t. Solving 
for P t ( 1) gives 

Pti 1) = At - P t (2) • 2 - P t ( 3) • 3 + • • • . (4.20) 

We see that Pt (1) is smaller than At due to the P t (2) and Pt(3), etc., probabilities. So 
P t (1) isn't equal to At. However, if all of the probabilities of multiple events occurring 
(P t (2), Pt(3), etc.) are very small, then P, (1) is essentially equal to At. And this is 
exactly what happens if the time interval is very small. For small times, there is hardly 
any chance of the event even occurring once. So it is even less likely that it will occur 
twice, and even less likely for three times, etc. 

We can be a little more precise about this. The following argument isn't completely 
rigorous, but it should convince you that if t is very small, then P, ( I) is essentially 
equal to At. If t is very small, then assuming we don't know yet that Pf (1) equals At, 
we can still say that it should be roughly proportional to At. This is true because if an 
event has only a tiny chance of occurring, then if you cut A in half, the probability is 
essentially cut in half. Likewise if you cut t in half. This proportionality then implies 
that the probability that exactly two events occur is essentially proportional to (At) 2 . 
We'll see in Section 4.7 that there is actually a factor of 1/2 involved here, but that 
is irrelevant in the present argument. The important point is the quadratic nature of 
(At) 2 . If At is sufficiently small, then (At) 2 is negligible compared with At. Likewise 
for P t ( 3) oc (At) 2 , etc. We can therefore ignore the scenarios where multiple events 
occur. So with t — * e, Eq. (4.20) becomes 

P e (l) * Ae - 2 -Se&T- 3 + • • • , (4.21) 

in agreement with Eq. (4.18). As mentioned above, if Ae is small, it is because e is 
small, at least in the situations we’ll be dealing with. 

2. Imagine drawing the A vs. t “curve.” We have put “curve” in quotes because the curve 
is actually just a straight horizontal line, since we're assuming a constant A. If we 
consider a time interval At, the associated area under the curve equals AAt, because 
we have a simple rectangular region. So from Eq. (4.18), this area gives the probability 
that an event occurs during a time At, provided that At is very small. This might make 
you think that A can be interpreted as a probability distribution, because we found in 
Section 4.2.3 that the area under a distribution curve gives the probability. However, 
the A “curve” cannot be interpreted as a probability distribution, because this area- 
equals-probability result holds only for very small At. The area under a distribution 
curve has to give the probability for any interval on the horizontal axis. The A “curve” 
doesn’t satisfy this property. The total area under the A “curve” is infinite (because the 
straight horizontal line extends for all time), whereas actual probability distributions 
must have a total area of 1. 

3. Since only one quantity. A, is needed to describe everything about a random process 
whose probability is uniform in time, any other quantity we might want to determine 
must be able to be written in terms of A. This will become evident below. * 
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4.6.3 Continuous case 

In the case of discrete time in Section 4.6.1, we asked two questions: What is the 
average waiting time between successes? And what is the probability distribution 
of the waiting times between successes? We’ll now answer these two questions in 
the case where time is a continuous quantity. 


Average waiting time 

As in the discrete case, the first of the two questions is fairly easy to answer. Let the 
average rate of success be A, and consider a large time t. We know from Eq. (4.16) 
that the average total number of events that occur during the time t is At. The aver¬ 
age waiting time (which we’ll call r) is the total time divided by the total number 
of events. At. That is, 


T 


At 



(average waiting time) 


(4.22) 


We see that the average waiting time is simply the reciprocal of the rate at which 
the events occur. For example, if the rate is 5 events per second, then the average 
waiting time is 1/5 of a second, which makes sense. This would of course be true 
in the nonrandom case where the events occur at exactly equally spaced intervals of 
1/5 second. But the nice thing is that Eq. (4.22) holds even for the random process 
we’re discussing, where the intervals aren’t equally spaced. 

It makes sense that the rate A is in the denominator in Eq. (4.22), because if .1 is 
small, the average waiting time is large. And if A is large, the average waiting time 
is small. And as promised in the third remark above, r depends on A. 


Distribution of waiting times 

Now let’s answer the second (more difficult) question: What is the probability distri¬ 
bution of the waiting times between successes? Equivalently, what is the probability 
that the waiting time from a given event to the next event is between t and t + At, 
where At is small? To answer this, we’ll use the same general strategy that we used 
in the discrete case in Section 4.6.1, except that now the time interval between iter¬ 
ations will be a very small time e instead of 1 second. We will then take the e —> 0 
limit, which will make time continuous. 

The division of time into little intervals is summarized in Fig. 4.13. From time 
zero (which is when we’ll assume the initial event happens) to time t, we’ll break 
up time into a very large number of very small intervals with length e (which means 
that there are f/e of these intervals). And then the interval of At sits at the end. Both 
e and At are assumed to be very small, but they need not have anything to do with 
each other, e exists as a calculational tool only, while At is the arbitrarily-chosen 
small time interval that appears in Eq. (4.2). 

In order for the next success (event) to happen between t and t + At, there must 
be failure during every one of the f/e intervals of length e shown in Fig. 4.13, and 
then there must be success between t and t + At. From Eq. (4.18), the latter happens 
with probability A At, because At is assumed to be very small. Also, Eq. (4.18) says 
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number of intervals = tle 
Figure 4.13: Dividing time into little intervals. 


that the probability of success in any given small interval of length e is Ae, which 
means that the probability of failure is 1 - Ae. And since there are t/e of these 
intervals, the probability of failure in all of them is (1 - Ae) T ' e . The probability 
that the next success happens between t and t + At. which we’ll label as P(t,At), is 
therefore 

P(t.At) = ((1 - Ae) t,£ )(AAt). (4.23) 

The reasoning that led to this equation is in the same spirit as the reasoning that led 
to Eq. (4.14). See the first remark below. 

It’s now time to use one of the results from Appendix C, namely the approxima¬ 
tion given in Eq. (7.14), which says that for small a we can write 4 

(1 +a) n « e na . (4.24) 


This works for negative a as well as positive a. Here e is Euler’s number, which 
has the value of e « 2.71828. (If you want to know more about e, there’s plenty 
of information in Appendix B!) For the case at hand, a comparison of Eqs. (4.23) 
and (4.24) shows that we want to define a = -Ae and n = t/e, which yields na = 
( t/e)(—Ae ) = —At. Eq. (4.24) then gives (1 - Ae)^ e ~ e ~ At , so Eq. (4.23) becomes 

P(t,At) = e~ At A At. (4.25) 


The probability distribution (or density) is obtained by simply erasing the At, 
because Eq. (4.2) says that the density is obtained by dividing the probability by 
the interval length. We therefore see that the desired probability distribution for the 
waiting time between successes is 


p{t) = Ae At 


(exponential distribution) 


(4.26) 


This is known as the exponential distribution. This name is appropriate, of course, 
because the distribution decreases exponentially with t. As promised in the third 
remark on page 201, the distribution depends on A (along with t, of course). In the 
present setup involving waiting times, it is often more natural to work in terms of the 
average waiting time r than the rate A, in which case the preceding result becomes 
(using T = 1/r from Eq. (4.22)) 


Pit) = 


e-Ur 


T 


(exponential distribution) 


(4.27) 


4 You are strongly encouraged to read Appendix C at this point, if you haven’t already. But if you 
want to take Eq. (4.24) on faith, that’s fine too. However, you should at least verify with a calculator that 
it works fairly well for, say, a = 0.01 and n = 200. 
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In the notation of Eq. (4.12), both A and b are equal to 1/r (or A). So they are in 
fact related, as we noted right after Eq. (4.12). 

Fig. 4.14 shows plots of p(t ) for a few different values of the average waiting 
time, t. The two main properties of each of these curves are the starting value at 
t — 0 and the rate of decay as t increases. From Eq. (4.27), the starting value at t - 0 
is e°/r = 1/r. So the bigger r is, the smaller the starting value. This makes sense, 
because if the average waiting time r is large (equivalently, if the rate A is small), 
then there is only a small chance that the next event will happen right away. 


P(0 



Figure 4.14: Examples of exponential distributions with different values of the average wait¬ 
ing time r. 


How fast do the curves decay? This is governed by the denominator of the 
exponent in Eq. (4.27). For every t units that t increases by, p{t) decreases by a 
factor of 1/e. This can be seen by plugging a time of t + t into Eq. (4.27), which 
gives 


g -(f+r)/T 

p(t + t) = - 



1 e~ ,/T 

e t 



(4.28) 


So p(t + t) is 1/e times as large as pit), and this holds for any value of t. A few 
particular values of pit) are 


P(0) = - , pir) = — , p(2r) = , p(3r) = -j- , (4.29) 

t er e z r e J r 

and so on. If r is large, the curve takes longer to decrease by a factor of 1/e. This is 
consistent with Fig. 4.14, where the large-r curve falls off slowly, and the small-r 
curve falls off quickly. To sum up, if r is large, the pit) curve starts off low and then 
decays slowly. And if r is small, the curve starts off high and then decays quickly. 
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Example (Same density): Person A measures a very large number of waiting times 
for a process with r = 5. Person B does the same for a process with r = 20. To their 
surprise, they find that for a special value of t, they both observe (roughly) the same 
number of waiting times that fall into a given small interval around t. What is this 
special value of f? 


Solution: The given information tells us that the probability densities for the two 
processes are equal at the special value of t. Plugging the r values of 5 and 20 into 
Eq. (4.27) and setting the results equal to each other gives 


e-t/5 _ e -r/20 

5 " 20 


=2 = gf/5—f/20 



1 

20 


In 4 = t 



t = 9.24. 


(4.30) 


This result agrees (at least to the accuracy of a visual inspection) with the value of t 
where the r = 5 and r = 20 curves intersect in Fig. 4.14. 

Although it might seem surprising that there exists a value of t for which the densities 
associated with two different values of r are equal, it is actually fairly clear, due to 
the following continuity argument. For small values of f, the r = 5 process has a 
larger density (because the events happen closer together), while for large values of 
t, the r = 20 process has a larger density (because the events happen farther apart). 
Therefore, by continuity, there must exist a particular value of t for which the densities 
are equal. But it takes the above calculation to find the exact value. 


Remarks: 

1. In comparing Eq. (4.23) with Eq. (4.14), we see in retrospect that we could have 
obtained Eq. (4.23) by simply replacing the first p in Eq. (4.14) with Ae (because Ae 
is the probability of success at each intermediate step), the second p with A At (this is 
the probability of success at the last step), and k — 1 with t/e (this is the number of 
intermediate steps). But you might find these replacements a bit mysterious without 
the benefit of the reasoning preceding Eq. (4.23). 

2. The area under each of the curves in Fig. 4.14 must be 1. The waiting time has to be 
something, so the sum of all the probabilities must be 1. The proof of this fact is very 
quick, but it requires calculus, so we'll relegate it to Problem 4.8(a). (But note that 
we did demonstrate this for the discrete case in Eq. (4.15).) Likewise, the expectation 
value of the waiting time must be r, because that’s how r was defined. Again, the 
proof is quick but requires calculus; see Problem 4.8(c). (The demonstration for the 
discrete case is the task of Problem 4.7.) 

3. We’ve been referring to pit ) as the probability distribution of the waiting times from 
one event to the next. However, pit) is actually the distribution of the waiting times 
from any point in time to the occurrence of the next event. That is, you can start 
your stopwatch at any time, not just at the occurrence of an event. If you go back 
through the above discussion, you will see that nowhere did we use the fact that an 
event actually occurred at t = 0. 

However, beware of the following incorrect reasoning. Let’s say that an event happens 
at t = 0, but that you don’t start your stopwatch until, say, t = 1. The fact that the 
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next event after t = 1 doesn’t happen (on average) until t = 1 + r (from the previous 
paragraph) seems to imply that the average waiting time from t = 0 is 1 + r. But it 
better not be, because we know from above that it’s just r. The error here is that we 
forgot about the scenarios where the next event after t = 0 happens between t = 0 and 
t = 1. When these events are included, the average waiting time, starting at t - 0, 
ends up correctly being r. (The demonstration of this fact requires calculus.) In short, 
the waiting time from t = 1 is indeed r, but the next event (after the t = 0 event) might 
have already happened before t - 1. 

4. In a sense, the curves for all of the different values of r in Fig. 4.14 are really the same 
curve. They’re just stretched or squashed in the horizontal and vertical directions. 
The general form of the curve described by the expression in Eq. (4.27) is shown in 
Fig. 4.15. 


p(0 



Figure 4.15: The general form of the exponential distribution. 


As long as we change the scales on the axes so that r and 1/r are always located at 
the same positions, then the curves will look the same for any r. For example, as we 
saw in Eq. (4.29), no matter what the value of r is, the value of the curve at f = r 
is always 1 je times the value at t = 0. Of course, when we plot things, we usually 
keep the scales fixed, in which case the r and 1/r positions move along the axes, as 
shown in Fig. 4.16 (these are the same curves as in Fig. 4.14). But by suitable uniform 
stretching/squashing of the axes, the curve in Fig. 4.15 can be turned into any of the 
curves in Fig. 4.16. 



Figure 4.16: These curves can be obtained from the curve in Fig. 4.15 by suitable 
stretching/squashing of the axes. 


5. The fact that any of the curves in Fig. 4.16 can be obtained from any of the other curves 
by stretching and squashing the two directions by inverse (as you can verify) factors 
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implies that the areas under all of the curves are the same. (This is consistent with the 
fact that all of the areas must be 1.) To see how these inverse factors work together to 
keep the area constant, imagine the area being broken up into a large number of thin 
vertical rectangles, stacked side by side under the curve. The stretching and squashing 
of the curve does the same thing to each rectangle. All of the widths get stretched 
by a factor of /, and all of the heights get squashed by the same factor of / (or 1//, 
depending on your terminology). So the area of each rectangle remains the same. The 
same thing must then be true for the area under the whole curve. 

6. Note that the distribution for the waiting time is a discrete distribution in the case of 
discrete time (see Eq. (4.14)), and a continuous distribution in the case of continuous 
time (see Eq. (4.27)). Although these facts make perfect sense, one should be careful 
about extrapolating to a general conclusion. In the Poisson discussion in the following 
section, we’ll encounter a discrete distribution in the case of continuous time. * 


4.7 Poisson distribution 

The goal of this section is to derive the Poisson probability distribution, 


P(k) = 


a k e~ a 


k\ 


(Poisson distribution) 


(4.31) 


The parameter a depends on the situation at hand, and k is the value of the random 
variable, which is the number of events that occur in a certain region of time (or 
space, or whatever), as we’ll discuss below. Since k is an integer (because it is the 
number of events that occur), the Poisson distribution is a discrete one. A common 
type of situation where this distribution arises is the following. 

As with the exponential distribution in the previous section, consider a repeating 
event that happens completely randomly in time. We will show that the probability 
distribution of the number of events that occur during a given time inten’al takes 
the form of the above Poisson distribution. Whereas the exponential distribution 
deals with the waiting time until the next event, the Poisson distribution deals with 
the number of events in a given time interval. As in the case of the exponential 
distribution, our strategy for deriving the Poisson distribution will be to first consider 
the case of discrete time, and then the case of continuous time. 


4.7.1 Discrete case 

Consider a process that is repeated each second (so time is discretized into 1-second 
intervals), and let the probability of success in each trial be p (the same for all 
trials). For example, as in Section 4.6.1, we can roll a hypothetical 10-sided die 
once every second, and if the die shows a “1,” then we consider that a success. The 
other nine numbers represent failure. As in Section 4.6.1, it isn’t actually necessary 
to introduce time here. We could simply talk about the number of iterations of the 
process, as we will in the balls-in-boxes example below. 

The question we will answer here is: What is the probability distribution of the 
number of successes that occur in a time interval of n seconds? In other words, 
what is the probability, P(k), that exactly k events happen during a time span of 
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n seconds? It turns out that this is exactly the same question that we answered in 
Section 4.5 when we derived the binomial distribution in Eq. (4.6). So we can just 
copy over the reasoning here. We’ll formulate things in the language of rolls of a 
die, with a “1” being a success. But the setup could be anything with a probability 
p of success. 

The probability that a specific set of k of the n rolls all yield a 1 equals p k , 
because each of the k rolls has a p probability of yielding a 1. We then need the 
other n - k rolls to not yield a 1, because we want exactly k l’s. This happens with 
probability (1 - p) n ~ k , because each of the n — k rolls has a 1 — p probability of 
being something other than a 1. The probability that a specific set of k rolls (and 
no others) all yield success is therefore p k ■ (1 - p)"~ k . Finally, since there are 
ways to pick a specific set of k rolls, we see that the probability that exactly k of the 
n rolls yield a 1 is 



(4.32) 


This distribution is exactly the same as the binomial distribution in Eq. (4.6), so 
there’s nothing new here. But there will indeed be something new when we discuss 
the continuous case in Section 4.7.2. 


Example (Balls in boxes): Let n balls be thrown randomly into b boxes. What is the 
probability, P(k), that a given box has exactly k balls in it? 

Solution: This is a restatement of the problem we just solved. Imagine randomly 
throwing one ball each second into the boxes, and consider a particular box. (As 
mentioned above, the time interval of one second is irrelevant. All that matters is that 
we perform n iterations of the process, sooner or later.) If a given ball ends up in that 
box, we’ll call that a success. For each ball, this happens with probability 1 lb, because 
there are b boxes. So the p in the above discussion equals 1 /b. Since we’re throwing 
n balls into the boxes, we’re simply performing n iterations of a process that has a 
probability p = l/b of success. Eq. (4.32) is therefore applicable, and with p = \/b 
it gives the probability of obtaining exactly k successes (that is, exactly k balls in a 
particular box) as 



(4.33) 


We've solved the problem, but let’s now see if our answer makes sense. As a concrete 
example, consider the case where we have n = 1000 balls and b = 100 boxes. On 
average, we expect to have n/b = 10 balls in each box. But many (in fact, most) of the 
boxes will have other numbers of balls. In theory, the number k of balls in a particular 
box can take on any value from 0 to n = 1000. But intuitively we expect most of 
the boxes to have roughly 10 balls (say, between 5 and 15 balls). We certainly don’t 
expect many boxes to have 2 or 50 balls. 

Fig. 4.17 shows a plot of the P(k) in Eq. (4.33), for the case where n = 1000 and 
b = 100. As expected, it is peaked near the average value, n/b = 10, and it becomes 
negligible a moderate distance away from k = 10. There is very little chance of having 
fewer than 3 or more than 20 balls in a given box; Eq. (4.33) gives P( 2) at 0.2% 
and P( 21) at 0.1%. We’ve arbitrarily chopped off the plot at k = 30 because the 
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probabilities between k = 30 (or even earlier) and k = 1000 are indistinguishable from 
zero. But technically all of these probabilities are nonzero. For example, F , (1000) = 
(l/lOO) 1000 , because if k = 1000 then all of the 1000 balls need to end up in the given 
box, and each one ends up there with probability 1/100. The resulting probability of 
IQ-2000 i s utterly negligible. 


m 
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0.06 
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0 = 1000 , 6 = 100 ) 



15 20 25 30 
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Figure 4.17: The probability distribution for the number of balls in a given box, if 
n = 1000 balls are thrown into b = 100 boxes. 


4.7.2 Continuous case 

As with the exponential distribution in Section 4.6.3, we’ll now consider the case 
where time is continuous. That is, we’ll assume that we can have a successful event 
at any instant, not just at the evenly-spaced 1-second marks, as we assumed above. 
As in Section 4.6.3, such a process can be completely described by just one number 
- the average rate of events, which we’ll again call A. Eq. (4.18) tells us that Ae 
is the probability that exactly one event occurs in a very small time interval e. The 
smaller the e, the smaller the probability that the event occurs. We’re assuming that 
A is constant in time, that is, the event is just as likely to occur at one time as any 
other. 

Our goal here is to answer the question: What is the probability, P(k), that 
exactly k events occur during a given time span of tl To answer this, we’ll use the 
same general strategy that we used above in the discrete case, except that now the 
time interval between iterations will be a very small time e instead of 1 second. We 
will then take the e —♦ 0 limit, which will make time continuous. The division of 
time into little intervals is summarized in Fig. 4.18. We’re dividing the time interval 
t into a very large number of very small intervals with length e. There are t/e of 
these intervals, which we’ll label as n. There is no need to stick a At interval on the 
end, as there was in Fig. 4.13. 

Compared with the discrete case we addressed above, Eq. (4.18) tells us that 
the probability of exactly one event occurring in a given small interval of length 
e is now Ae instead of p. So we can basically just repeat the derivation preceding 
Eq. (4.32), which itself was a repetition of the derivation preceding Eq. (4.6). You’re 
probably getting tired of it by now! 
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r 


length = s 


number of intervals = t/e = n 
Figure 4.18: Dividing time into little intervals. 


The probability that a specific set of k of the n little intervals all yield exactly 
one event each equals (Ae) k , because each of the k intervals has a Ae probability 
of yielding one event. We then need the other n - k intervals to not yield an event, 
because we want exactly k events. This happens with probability (1 - Ae) n ~ k , 
because each of the n - k intervals has a 1 - Ae chance of yielding zero events. 
The probability that a specific set of k intervals (and no others) all yield an event is 
therefore (Ae) k ■ (1 - Ae) n ~ k . Finally, since there are (" 'j ways to pick a specific 
set of k intervals, we see that the probability that exactly k of the n intervals yield 
an event is 

P{k) = r\{Ae) k {\-Ae) n ~ k . (4.34) 

This is simply Eq. (4.32) with p replaced by Ae. 

Now it’s time to have some mathematical fun. Let’s see what Eq. (4.34) reduces 
to in the e —> 0 limit, which will give us the desired continuous-time limit. Note 
that e —> 0 implies that n = t/e —> oo. The math here will be a little more involved 
than the math that led to the exponential distribution in Eq. (4.26). 

If we write out the binomial coefficient and expand things a bit, Eq. (4.34) be¬ 
comes 

P(k) = 7 TTTTT C^C 1 " Ae y (l ~ Ae ^ k - (435 > 

(n - k) \ k\ 

Of the various letters in this equation, n is huge, e is tiny, and A and k are “normal,” 
not assumed to be huge or tiny. A is determined by the setup, and k is the number 
of events we’re concerned with. (We’ll see below that the relevant k’s are roughly 
the size of the product At - Ane .) In the e —> 0 limit (and hence n —> oo limit), we 
can make three approximations to Eq. (4.35): 


• First, in the n —> oo limit, we can say that 


nl 

(n - k)l 


(4.36) 


at least in a multiplicative sense (we don’t care about an additive sense). This 
follows from the fact that n\/{n - k) \ is the product of the k numbers from 
n down to n — k + 1. And if n is large compared with k, then all of these k 
numbers are essentially equal to n (multiplicatively). Therefore, since there 
are k of them, we simply get n k . You can verify this for, say, the case of 
n = 1,000,000 and k - 10. The product of the 10 numbers from 1,000,000 
down to 999,991 equals 1,000,000 10 to within an error of 0.005% 


• Second, we can apply the (1 + a)' 1 ~ e na approximation from Eq. (7.14) in 
Appendix C, which we already used once in the derivation of the exponential 
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distribution; see the discussion following Eq. (4.24). We can use this approx¬ 
imation to simplify the (1 - Ae) n term. With a = -Ae, Eq. (7.14) gives 

(1 - Ae) n ~ e~ nAe . (4.37) 

• Third, in the e — » 0 limit, we can use the (1 +a)" ~ e" a approximation again, 
this time to simplify the (1 - Ae)~ k term. The result is 

(1 -Ae)~ k « e kAe » e° = 1, (4.38) 

because for any fixed values of k and A, the kAe exponent becomes infinites¬ 
imally small as e —» 0. Basically, in (1 - Ae)~ k we’re forming a finite power 
of a number that is essentially equal to 1. Note that this reasoning doesn’t 
apply to the (1 - Ae)" term in Eq. (4.37), because n isn’t a fixed number. It 
changes with e. in that it becomes large as e becomes small. 

In the e —» 0 and n —> oo limits, the signs in the approximations in the 
preceding three equations turn into exact “=” signs. Applying these three approxi¬ 
mations to Eq. (4.35) gives 

P(k) = , Ue)*(l - 4e)"(l - Ae)~ k 

(n — k)\k\ 

n k 

= — (Ae) k e~ nAe - 1 
k\ 

^ i „ \k — A np 

= —(A • ne) e 

= —(At) k e~ At , (4.39) 

At! 

where we have used n = t/e ne = t to obtain the last line. Now, from Eq. (4.16) 
At is the average number of events that are expected to occur in the time t. Let’s 
label this average number of events as a = At. We can then write Eq. (4.39) as 

(Poisson distribution) (4.40) 

where a is the average number of events in the time interval under consideration. If 
you want, you can indicate the a value by writing P(k) as P a (k). 

Since a is the only parameter left on the righthand side of Eq. (4.40), the distri¬ 
bution is completely specified by a. The individual values of A and t don’t matter. 
All that matters is their product a = At. This means that if we, say, double the 
time interval t under consideration and also cut the rate A in half, then a remains 
unchanged; so we have exactly the same distribution P(k). Although it is clear that 
doubling t and halving A yields the same average number of events (since the aver¬ 
age equals the product At), it might not be intuitively obvious that the entire P(k) 
distribution is the same. But the result in Eq. (4.40) shows that this is indeed the 
case. 

The Poisson distribution in Eq. (4.40) gives the probability of obtaining exactly 
k events during a period of time for which the expected number is a. Since k is 
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a discrete variable (being the integer number of times that an event occurs), the 
Poisson distribution is a discrete distribution. Although the Poisson distribution 
is derived from a continuous process (in that the time t is continuous, which means 
that an event can happen at any time), the distribution itself is a discrete distribution, 
because k must be an integer. Note that while the observed number of events k must 
be an integer, the average number of events a need not be. 

Remark: Let's discuss this continuous/discrete issue a little further. In the last remark in 
Section 4.6.3, we noted that the exponential distribution for the waiting time, t , is a discrete 
distribution in the case of discrete time, and a continuous distribution in the case of contin¬ 
uous time. This seems reasonable. But for the Poisson distribution, the distribution for the 
number of events, k, is a discrete distribution in the case of discrete time, and also (as we just 
noted) a discrete distribution in the case of continuous time. It is simply always a discrete 
distribution, because the random variable is the number of events, k , which is discrete. The 
fact that time might be continuous is irrelevant, as far as the discreteness of k goes. The 
difference in the case of the exponential distribution is that time itself is the random variable 
(because we’re considering waiting times). So if we make time continuous, then by definition 
we’re also making the random variable continuous, which means that we have a continuous 
distribution. * 


Example (Number of shoppers): On average, one shopper enters a given store every 
15 seconds. What is the probability that in a given time interval of one minute, zero 
shoppers enter the store? Four shoppers? Eight shoppers? 

Solution: The given average time interval of 15 seconds tells us that the average 
number of shoppers who enter the store in one minute is a = 4. Having determined a, 
we simply need to plug the various values of k into Eq. (4.40). For k = 0, 4, and 8 we 
have 

4°e~ 4 

P( 0) = -= 1 • e“ 4 * 0.018 ~ 2%, 

0! 

4 4 e -4 39 „ 

P( 4) = - = — ■ e“ 4 * 0.195 ~ 20%, 

4! 3 

4 8 e -4 512 a 

P( 8) = -jp = --e~ 4 * 0.030 = 3%. (4.41) 

We see that the probability that four shoppers enter the store in a given minute is about 
10 times the probability that zero shoppers enter. The probabilities quickly die off as 
k gets larger. For example, P( 12) « 0.06%. 

The above results are a subset of the information contained in the plot of P{k) shown 
in Fig. 4.19. Note that P{ 3) = P{ 4). This is evident from the above expression for 
P( 4), because if we cancel a factor of 4 in the numerator and denominator, we end up 
with 4 3 e -4 /3! which equals P( 3). See Problem 4.10 for more on this equality. 
Remember that when finding P(k), the only parameter that matters is a. If we modify 
the problem by saying that on average one shopper enters the store every 15 minutes , 
and if we change the time interval to one hour (in which case a again equals 4), then 
all of the P(k) values are exactly the same as above. 
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Figure 4.19: The Poisson distribution with a = 4. 


Example (Balls in boxes, again): Although Eq. (4.40) technically holds only in the 
limit of a continuous process, it still provides a very good approximation for discrete 
processes, as long as the numbers involved are fairly large. Consider the balls-in-boxes 
example in Section 4.7.1. With n = 1000 and b = 100, the average number of balls in 
a box is a = n/b = 10. Since b is fairly large, we expect that the Poisson distribution 
in Eq. (4.40) with a = 10 will provide a good approximation to the exact binomial 
distribution in Eq. (4.33) with n = 1000 and b = 100, or equivalently Eq. (4.32) with 
n = 1000 and p=l/b= 1/100. 

Let's see how good the approximation is. Fig. 4.20 shows plots for two different sets 
of n and b values: n = 100, b = 10; and n = 1000, b = 100. With these values, both 
plots have a = 10. The dots in the second plot are a copy of the dots in Fig. 4.17. 
In both plots we have superimposed the exact discrete binomial distribution (the dots) 
and the Poisson distribution (the curves). 5 Since the plots have the same value of a, 
they have the same Poisson curve. In the right plot, the points pretty much lie on the 
curve, so the approximate Poisson probabilities in Eq. (4.40) are essentially the same 
as the exact binomial probabilities in Eq. (4.33). In other words, the approximation is 
a very good one. 

However, in the left plot, the points lie slightly off the curve. The average a = n/b still 
equals 10, so the Poisson curve is exactly the same as in the right plot. But the exact 
binomial probabilities in Eq. (4.33) are changed from the n = 1000 and b = 100 case. 
The Poisson approximation doesn't work as well here, although it’s still reasonably 
good. The condition under which the Poisson approximation is a good one turns out 
to be the very simple relation, p = 1 /b <K 1. See Problem 4.14. 


The Poisson distribution in Eq. (4.40) works perfectly well for small a, even 
a < 1. It’s just that in this case, the plot of P(k) doesn’t have a bump, as it does in 
Figs. 4.19 and 4.20. Instead, it starts high and then falls off as k increases. Fig. 4.21 
shows the plot of P(k) for various values of a. We’ve arbitrarily decided to cut off 
the plots at k = 20, even though they technically go on forever. Since we are assum¬ 
ing that time is continuous, we can theoretically have an arbitrarily large number of 

5 We’ve drawn the Poisson distribution as a continuous curve (the k ! in Eq. (4.40) can be extrapolated 
to non-integer values of k), because it would be difficult to tell what’s going on in the figure if we piotted 
two sets of points nearly on top of each other. But you should remember that we’re really only concerned 
with integer values of k, since the k in Eq. (4.40) is the number of times something occurs. We’ve plotted 
the whole curve for visual convenience only. 
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11= 100, 6=10 ( p = 1/10) 
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« = 1000 , 6=100 0 = 1 / 100 ) 
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Dots = exact binomial result 
Curves = Poisson approximation (both plots have a= 10) 

Figure 4.20: Comparison between the exact binomial result and the Poisson approximation. 


events in any given time interval, although the probability will be negligibly small. 
In the plots, the probabilities are effectively zero by k — 20, except in the a — 15 
case. 
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Figure 4.21: The Poisson distribution for various values of a. 


As a increases, the bump in the plots (once it actually becomes a bump) does 
three things: (1) it shifts to the right, because it is centered near k = a, due to the 
result in Problem 4.10, (2) it decreases in height, due to the result in Problem 4.11, 
and (3) it becomes wider, due to the result in Problem 4.13. The last two of these 
properties are consistent with each other, in view of the fact that the sum of all the 
probabilities must equal 1, for any value of a. 

Eq. (4.40) gives the probability of obtaining zero events as P( 0) = e~ a . If 
a = 0.5 then P( 0) = e -0 5 « 0.61. This agrees with the first plot in Fig. 4.21. 
Likewise, if a — 1 then P( 0) = e~ l ~ 0.37, in agreement with the second plot. If a 
is large then the P{ 0) = e~" probability goes to zero, in agreement with the bottom 
three plots. This makes sense; if the average number of events is large, then it is 
very unlikely that we will obtain zero events. In the opposite extreme, if a is very 
small (for example, a = 0.01), then the /TO) = e~ a probability is very close to 1. 
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This again makes sense; if the average number of events is very small, then it is very 
likely that we will obtain zero events. 

To make it easier to compare the six plots in Fig. 4.21, we have superimposed 
them in Fig. 4.22. Although we have drawn these Poisson distributions as contin¬ 
uous curves to make things clearer, remember that the distribution applies only to 
integer values of k. 


m 



Figure 4.22: Superimposing the plots in Fig. 4.21, drawn as continuous curves. 

Problems 4.9 through 4.13 cover various aspects of the Poisson distribution, 
namely: the fact that the total probability is 1, the location of the maximum, the 
value of the maximum, the expectation value, and the variance. 


4.8 Gaussian distribution 


The Gaussian probability distribution (also known as the “normal distribution” or 
the “bell curve”) is the most important of all the probability distributions. The 
reason, as we will see in Chapter 5, is that in the limit of large numbers, many other 
distributions reduce to a Gaussian. But for now, well just examine the mathematical 
properties of the Gaussian distribution. The distribution is commonly written in 
either of the following forms: 


fix) = 



or 



e -(x-n) 2 /2 cr 1 


(4.42) 


If you want to explicitly indicate the parameters that appear, you can write the 
distribution as f^b(x) or f MtC r(x). The Gaussian distribution is a continuous one. 
That is, x can take on a continuum of values, like 1 in the exponential distribution, 
but unlike k in the binomial and Poisson distributions. The Gaussian probability 
distribution is therefore a probability density. As mentioned at the beginning of 
Section 4.2.2, the standard practice is to use lowercase letters (like the / in f(x)) 
for probability densities, and to use uppercase letters (like the P in P(k)) for actual 
probabilities. 

The second expression in Eq. (4.42) is obtained from the first by letting b = 
1/2cr 2 . The first expression is simpler, but the second one is more common. This 
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is due to the fact that the standard deviation, which we introduced in Section 3.3, 
turns out simply to be cr. Hence our use of the letter cr here. Note that b (or cr) 
appears twice in the distribution - in the exponent and in the prefactor. These two 
appearances conspire to make the total area under the distribution equal to 1. See 
Problem 4.22 for a proof of this fact. 

The quantities p and b (or p and cr) depend on the specific situation at hand. 
Let’s look at how these quantities affect the shape and location of the curve. We’ll 
work mainly with the first form in Eq. (4.42) here, but any statements we make 
about b can be converted into statements about cr by replacing h with 1 /2cr 2 . 

Mean 

Let’s consider p first. Fig. 4.23 shows the plots of two Gaussian distributions, one 
with b — 2 and p — 6, and the other with b = 2 and p = 10. The two functions are 



Figure 4.23: Gaussian distributions with different means. 

It is clear from the plots that p is the location of the maximum of the curve. 
Mathematically, this is true because the e~ b( - x ~^~ exponential factor has an expo¬ 
nent that is either zero or negative (because a square is always zero or positive). So 
this exponential factor is always less than or equal to 1. Its maximum value occurs 
when the exponent is zero, that is, when x - p. The peak is therefore located at 
x - p. If we increase p (while keeping b the same), the whole curve just shifts to 
the right, keeping the same shape. This is evident from the figure. 

Because the curve is symmetric around the maximum, p is also the mean (or 
expectation value) of the distribution: 

Mean = p. (4.44) 

Since we used the letter p for the mean throughout Chapter 3, it was a natural choice 
to use p the way we did in Eq. (4.42). Of course, for the same reason, it would also 
have been natural to use p for the mean of the exponential and Poisson distributions. 
But we chose to label those means as r and a , so that there wouldn’t be too many 
p’s floating around in this chapter. 
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Height 

Now let’s consider b. Fig. 4.24 shows the plots of two Gaussian distributions, one 
with b - 2 and /./ = 6, and the other with b - 8 and fj = 6. The two functions are 

fix) = a e -2 G -6 T and fix) = e _8(A_6)2 . (4.45) 

V 7r V n 

Note that the scales on both the x and y axes in Fig. 4.24 are different from those in 
Fig. 4.23. The first function here is the same as the first function in Fig. 4.23. 

m m 




Figure 4.24: Gaussian distributions with different values of b. Both the heights and the 
widths differ. 


It is clear from the plots that b affects both the height and width of the curve. 
Let’s see how these two effects come about. The effect on the height is easy to 
understand, because the height of the curve (the maximum value of the function) is 
simply sfbfn. This is tme because when x equals (which is the location of the 
maximum), the g-M*-/*) factor equals 1, in which case the value of sfbjn e~ b(x ~^ 
is just xTbJn. (By the same reasoning, the second expression in Eq. (4.42) gives the 
height in terms of cr as 1 / V liter 2 . ) Looking at the two functions in Eq. (4.45), we 
see that the ratio of the heights is V8/2 = 2. And this is indeed the ratio we observe 
in Fig. 4.24. To summarize: 


Height = 




(4.46) 


Width in terms of b 

Now for the width. We see that the second function in Fig. 4.24 is both taller and 
narrower than the first. (But it has the same midpoint, because we haven’t changed 
/j.) The factor by which it is shrunk in the horizontal direction appears to be about 
1/2. And in fact, it is exactly 1/2. It turns out that the width of a Gaussian curve 
is proportional to 1 / Vb. This means that since we increased b by a factor of 4 in 
constructing the second function, we decreased the width by a factor of 1 / V4 = 
1/2. Let’s now show that the width is in fact proportional to 1/ Vb. 

But first, what do we mean by “width”? A vertical rectangle has a definite width, 
but a Gaussian curve doesn’t, because the “sides” are tilted. We could arbitrarily 
define the width to be how wide the curve is at a height equal to half the maximum 
height. Or instead of half, we could say a third. Or a tenth. We can define it 
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however we want, but the nice thing is that however we choose to define it, the above 
“proportional to 1 / Vb ” result will still hold, as long as we pick one definition and 
stick with it for whatever curves we’re looking at. Similarly, if we want to work 
with the second expression in Eq. (4.42), then since 1 / Vb oc cr, the width will be 
proportional to cr, independent of the specifics of our arbitrary definition. 

The definition we’ll choose here is: The width of a curve is the width at the 
height equal to 1/e (which happens to be about 0.37) times the maximum height 
(which is sjb/n). This 1/e choice is a natural one, because the x values that corre¬ 
spond to this height are easy to find. They are simply p ± 1 / Vb, because the first 
expression in Eq. (4.42) gives 

f(p± 1 / Vb) = sjb/n e ~ b ^t l±l /Vb)-u] 

= sfbfn g _fc (±l/VI>) 

= 4bhte~ blb 

fb 1 

= J -, (4.47) 

V n e 

as desired. Since the difference between p + 1 / Vb and p - 1 / Vb equals 2/ Vb, the 
width of the Gaussian curve (by our arbitrary definition) is 2/ Vb. So 1/ Vb is half 
of the width, which we’ll call the “half-width”. (The term “half-width” can also 
refer to the full width of the curve at half of the maximum height. We won’t use 
that meaning here.) Again, any other definition of the width would also yield the 
V~b in the denominator. That’s the important part. The 2 in the numerator doesn’t 
have much significance. The half-width is shown below in Fig. 4.25, following the 
discussion of the width in terms of cr. 


Width in terms of cr 

When working with the second form in Eq. (4.42) (which is the more common of 
the two), the default definition of the width is the width at the height equal to 1 / sfe 
times the maximum height. This definition (which is different from the above 1/e 
definition) is used because the values of x that correspond to this height are simply 
x + cr. This is true because if we plug x - p ± cr into the second expression in 
Eq. (4.42), we obtain 



The factor of 1/ sfe here equals 1/ V2.718 « 0.61, which is larger than the 1/e « 
0.37 factor in our earlier definition. This is consistent with the fact that the x = p±cr 
points (where the height is 1/ V~e ~ 0.61 times the maximum) are closer to the 
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center than the x = /j. ± 1/ Vb = ji + V2cr points (where the height is 1/e « 0.37 
times the maximum). This is summarized in Fig. 4.25; we have chosen /r = 0 for 
convenience. 




x 


V2 o 


Figure 4.25: Different definitions of the half-width, in terms of b and cr. 


Although the x - /a ± cr points yield a nice value of the Gaussian distribution 
(1 j sfe times the maximum), the really nice thing about the x = /a ±<r points is that 
they are one standard deviation from the mean ji. It can be shown (with calculus, 
see Problem 4.23) that the standard deviation (defined in Eq. (3.40)) of the Gaussian 
distribution given by the second expression in Eq. (4.42) is simply cr. This is why 
the second form in Eq. (4.42) is more widely used than the first. And for the same 
reason, people usually choose to (arbitrarily) define the half-width of the Gaussian 
curve to be cr instead of the 1 / Vb = \[2 cr half-width that we found earlier. That 
is, they’re defining the width by looking at where the function is 1/ sfe times the 
maximum, instead of 1 /e times the maximum. As we noted earlier, any such defini¬ 
tion is perfectly fine; it’s a matter of person preference. The critical point is that the 
width is proportional to cr (or 1/ sfb). The exact numerical factor involved is just a 
matter of definition. 

As mentioned on page 153, it can be shown numerically that about 68% of the 
total area (probability) under the Gaussian curve lies between the points ± cr. 
In other words, you have a 68% chance of obtaining a value of a: that is within 
one standard deviation from the mean //. We used the word “numerically” above, 
because although the areas under the curves (or the discrete sums) for all of the other 
distributions we’ve dealt with in the chapter can be calculated in closed form, this 
isn’t true for the Gaussian distribution. So when finding the area under the Gaussian 
curve, you always need to specify the numerical endpoints of your interval, and then 
you can use a computer to calculate the area (numerically, to whatever accuracy 
you want). It can likewise be shown that the percentage of the total area that is 
within two standard deviations from /./ (that is, between the points p ± 2cr) is about 
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95%. And the percentage within three standard deviations from p is about 99.7%. 
These percentages are consistent with a visual inspection of the shaded areas in 
Fig. 4.26. The percentages rapidly approach 100%. The percentage within five 
standard deviations from p is about 99.99994%. 




Percentage 
of total area: 

ct = 68 % 
2a = 95% 
3a = 99.7% 


Figure 4.26: Areas under a Gaussian distribution within cr, 2cr, and 3 cr from the mean. 


Remarks: 

1. The Gaussian distribution is a continuous one, because x can take on any value. The 
distribution applies (either exactly or approximately) to a nearly endless list of pro¬ 
cesses with continuous random variables such as length, time, light intensity, affinity 
for butternut squash, etc. 

We’ll find in Sections 5.1 and 5.3 that the Gaussian distribution is a good approxi¬ 
mation to the binomial and Poisson distributions if the numbers involved are large. 
In these cases, only integer values of x are relevant, so the distribution is effectively 
discrete. You can still draw the continuous curve described by Eq. (4.42), but it is 
relevant only for integer values of x. 

2. We mentioned near the beginning of this section that the value of the prefactor in the 
expressions in Eq. (4.42) makes the total area under the distribution curve be equal to 
1. Problem 4.22 gives a proof of this, but for now we can at least present an argument 
that explains why the prefactor must be proportional to 1 /cr (or equivalently, to xfb). 
Basically, since the width of the curve is proportional to cr (as we showed above), the 
height must be proportional to 1 /cr. This is true because if you increase cr by a factor 
of, say, 10 and thereby stretch the curve by a factor of 10 in the horizontal direction, 
then you also have to squash the curve by a factor of 10 in the vertical direction, if 
you want to keep the area the same. (See the fifth remark on page 206.) A factor of 
1/cr in the prefactor accomplishes this. But note that this reasoning tells us only that 
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the prefactor is proportional to 1/cr, and not what the constant of proportionality is. It 
happens to be 1 / V2 ~n. 

3. Two parameters are needed to describe the Gaussian distribution: p and <x (or p and 
b). This should be contrasted with the Poisson distribution, where only one parameter, 
a, is needed. Similarly, the exponential distribution depends on only the one parameter 
A (or t). In the Poisson case, not only does the width determine the height, but it also 
determines the location of the bump. In contrast, the Gaussian mean p need not have 
anything to do with <x (or b). * 


4.9 Summary 

In this chapter we learned about probability distributions. In particular, we learned: 

• A probability distribution is the collective information about how the total 
probability (which is always 1) is distributed among the various possible out¬ 
comes of the random variable. 

• A probability distribution for a continuous random variable is given in terms 
of a probability density. To obtain an actual probability, the density must be 
multiplied by an interval of the random variable. More generally, the proba¬ 
bility equals the area under the density curve. 

We discussed six specific probability distributions: 

• 1. Uniform: (Continuous) The probability density is uniform over a given 
span of random-variable values, and zero otherwise. The uniform distribution 
can be described by two parameters: the mean and the width, or alternatively 
the endpoints of the nonzero region. These two parameters then determine 
the height. 

• 2. Bernoulli: (Discrete) The random variable can take on only two values, 1 
and 0, with probabilities p and 1 - p. An example with p = 1/2 is a coin toss 
with Heads = 1 and Tails = 0. The Bernoulli distribution is described by one 
parameter: p. 

• 3. Binomial: (Discrete) The random variable is the number k of successes 
in a collection of n Bernoulli processes. An example is the total number of 
Heads in n coin tosses. The distribution takes the form, 

P(k)= [kj pk(l ~ P)nk - (4A9) 

The number k of successes must be an integer, of course. The binomial dis¬ 
tribution is described by two parameters: n and p. 

• 4. Exponential: (Continuous) This is the probability distribution for the wait¬ 
ing time t until the next event, for a completely random process. We derived 
this by taking the continuum limit of the analogous discrete result, which was 
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the geometric distribution given in Eq. (4.14). The exponential distribution 
takes the form, 

g-t/T 

Pit) = -, (4.50) 

T 

where r is the average waiting time. Equivalently, pit) = Ae~' lt , where 
A = 1/r is the average rate at which the events happen. The exponential 
distribution is described by one parameter: r (or A). 

• 5. Poisson: (Discrete) This is the probability distribution for the number of 
events that happen in a given region (of time, space, etc.), for a completely 
random process. We derived this by taking the continuum limit of the analo¬ 
gous discrete result, which was simply the binomial distribution. The Poisson 
distribution takes the form. 


a k e a 

P{k) = ~kT’ (4.51) 

where a is the expected number of events in the given region. The number 
k of observed events must be an integer, of course. But a need not be. The 
Poisson distribution is described by one parameter: a. 

• 6 . Gaussian: (Continuous) This distribution takes the form, 

f(x)=^ e - b(x - fl)2 or J —^ e -(^) 2 / 2 ^ 2 . ( 4 . 52 ) 

V 7r V 2tuj l 

cr is the standard deviation of the distribution. About 68 % of the probability 
is contained in the range from p - cr to p + cr. The width of the distribution 
is proportional to cr (and to 1/ yb). The Gaussian distribution is described by 
two parameters: p and cr (or p and b). 


4.10 Exercises 

See www.people.fas.harvard.edu/~djmorin/book.html for a supply of problems 
without included solutions. 


4.11 Problems 

Section 4.2: Continuous distributions 

4.1. Fahrenheit and Celsius * 

Fig. 4.4 shows the probability density for the temperature, with the temper¬ 
ature measured in Fahrenheit. Draw a reasonably accurate plot of the same 
probability density, but with the temperature measured in Celsius. (The con¬ 
version from Fahrenheit temperature F to Celsius temperature C is C = 
(5/9 )(F - 32). So it takes a A F of 9/5 = 1.8 degrees to create a AC of 1 
degree.) 
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4.2. Expectation of a continuous distribution * (calculus) 

The expectation value of a discrete random variable is given in Eq. (3.4). 
Given a continuous random variable with probability density p(x), explain 
why the expectation value is given by the integral f xp(x ) dx. 

Section 4.3: Uniform distribution 

4.3. Variance of the uniform distribution * (calculus) 

Using the general idea from Problem 4.2, find the variance of a uniform dis¬ 
tribution that extends from x — 0 to x — a. 

Section 4.5: Binomial distribution 

4.4. Expectation of the binomial distribution ** 

Use Eq. (3.4) to explicitly demonstrate that the expectation value of the bino¬ 
mial distribution in Eq. (4.6) equals pn. This must be true, of course, because 
a fraction p of the n trials yield success, on average, by the definition of p. 
Hint: The goal is to produce the result of pn, so try to factor a pn out of the 
sum in Eq. (3.4). You will eventually need to use an expression analogous to 
Eq. (4.10). 

4.5. Variance of the binomial distribution *** 

As we saw in Problem 4.4, the expectation value of the binomial distribution 
is p = pn. Use the technique in either of the solutions to that problem to 
show that the variance of the binomial distribution is np( 1 - p) = npq (in 
agreement with Eq. (3.33)). Hint: The form of the variance in Eq. (3.34) 
works best. When finding the expectation value of k 2 (or really K 2 , where K 
is the random variable whose value is k), it is easiest to find the expectation 
value of k(k - l) and then add on the expectation value of k. 

4.6. Hypergeometric distribution *** 

(a) A box contains N balls. K of them are red, and the other N - K are 
blue. (K here is just a given number, not a random variable.) If you 
draw n balls without replacement , what is the probability of obtaining 
exactly k red balls? The resulting probability distribution is called the 
hypergeometric distribution. 

(b) In the limit where N and K are very large, explain in words why the 
hypergeometric distribution reduces to the binomial distribution given 
in Eq. (4.6), with p = K/N. Then demonstrate this fact mathematically. 
What exactly is meant by “A and K are very large”? 

Section 4.6: Exponential distribution 

4.7. Expectation of the geometric distribution ** 

Verify that the expectation value of the geometric distribution in Eq. (4.14) 
equals l/p. (This is the waiting time we found by an easier method in 
Eq. (4.13).) The calculation involves a math trick, so you should do Prob¬ 
lem 3.1 before solving this one. 
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4.8. Properties of the exponential distribution ** (calculus) 

(a) By integrating the exponential distribution in Eq. (4.27) from t = 0 to 
t — oo, show that the total probability is 1. 

(b) What is the median value tl That is, for what value f me d are you equally 
likely to obtain a t value larger or smaller than f me d? 

(c) By using the result from Problem 4.2, show that the expectation value 
is r, as we know it must be. 

(d) Again by using Problem 4.2, find the variance. 


Section 4.7: Poisson distribution 

4.9. Total probability * 

Show that the sum of all of the probabilities in the Poisson distribution given 
inEq. (4.40) equals 1, as we know it must. Hint: You will need to use Eq. (7.7) 
in Appendix B. 

4.10. Location of the maximum ** 

For what (integer) value of k is the Poisson distribution P(k) maximum? 

4.11. Value of the maximum * 

For large a, what approximately is the height of the bump in the Poisson P( k) 
plot? You will need the result from the previous problem. Hint: You will also 
need to use Stirling’s formula, given in Eq. (2.64) in Section 2.6. 

4.12. Expectation of the Poisson distribution ** 

Use Eq. (3.4) to verify that the expectation value of the Poisson distribution 
equals a. This must be the case, of course, because a is defined to be the 
expected number of events in the given interval. 

4.13. Variance of the Poisson distribution ** 

As we saw in Problem 4.12, the expectation value of the Poisson distribution 
is p = a. Use the technique in the solution to that problem to show that 
the variance of the Poisson distribution is a (which means that the standard 
deviation is \[a ). Hint: When finding the expectation value of k 2 , it is easiest 
to find the expectation value of k(k -1) and then add on the expectation value 
of k. 

4.14. Poisson accuracy ** 

In the “balls in boxes, again” example on page 213, we saw that in the right 
plot in Fig. 4.20, the Poisson distribution is an excellent approximation to the 
exact binomial distribution. But in the left plot, it is only a so-so approxima¬ 
tion. What parameter(s) determine how good the approximation is? 

To answer this, we’ll define the “goodness” of the approximation to be the 
ratio of the Poisson expression Pp(k) in Eq. (4.40) to the exact binomial ex¬ 
pression PbW in Eq- (4.32), with both functions evaluated at the expected 



4.11. Problems 


225 


value of k, namely a = pn, which we’ll assume is an integer. (We’re using 
Eq. (4.32) instead of Eq. (4.33), just because it’s easier to work with. The ex¬ 
pressions are equivalent, with p l/b.) The closer the ratio Pp(pn) / Pp(pn) 
is to 1, the better the Poisson approximation is. Calculate this ratio. You will 
need to use Stirling’s formula, given in Eq. (2.64). You may assume that n is 
large (because otherwise there wouldn’t be a need to use the Poisson approx¬ 
imation). 

4.15. Bump or no bump * 

In Fig. 4.21, we saw that P( 0) = P(l) when a = 1. (This is the cutoff 
between the distribution having or not having a bump.) Explain why this is 
consistent with what we noted about the binomial distribution (namely, that 
P( 0) = P( 1) when p = 1 /(n + 1)) in the example in Section 4.5. 

4.16. Typos * 

A hypothetical writer has an average of one typo per 50 pages of work. (Wish¬ 
ful thinking, perhaps!) What is the probability that there are no typos in a 
350-page book? 

4.17. Boxes with zero balls * 

You randomly throw n balls into 1000 boxes and note the number of boxes 
that end up with zero balls in them. If you repeat this process a large number 
of times and observe that the average number of boxes with zero balls is 20, 
what is n? 


4.18. Twice the events ** 


(a) Assume that on average, the events in a random process happen a times, 
where a is large, in a given time interval t. With the notation P a (k) 
representing the Poisson distribution, use Stirling’s formula (given in 
Eq. (2.64)) to produce an approximate expression for the probability 
P a ( a ) that exactly a events happen during the time t. 

(b) Consider the probability that exactly twice the number of events, 2a, 
happen during twice the time, 2 1. What is the ratio of this probability to 
Pa(a)1 

(c) Consider the probability that exactly twice the number of events, 2a, 
happen during the same time, t. What is the ratio of this probability to 
PaWl 


4.19. P(0) the hard way *** 

For a Poisson process with a expected events, Eq. (4.40) gives the probability 
of having zero events as 


Pi 0) = 


a°e~ a 


0 ! 


= e 


i 3 

-a , I a 

= \- \a- — + —- 


2! 3! 


(4.53) 


where we have used the Taylor series for e x given in Eq. (7.7). With the 
above grouping of the terms, the sum in parentheses must be the probability 
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of having at least one event, because when this is subtracted from 1, we obtain 
the probability of zero events. Explain why this is the case, by accounting 
for the various multiple events that can occur. You will want to look at the 
remark in the solution to Problem 2.3 first. The task here is to carry over that 
reasoning to a continuous Poisson process. 

4.20. Probability of at least 1 ** 

A million balls are thrown at random into a billion boxes. Consider a particu¬ 
lar one of the boxes. What (approximately) is the probability that at least one 
ball ends up in that box? Solve this by: 

(a) using the Poisson distribution in Eq. (4.40); you will need to use the 
approximation in Eq. (7.9), 

(b) working with probabilities from scratch; you will need to use the ap¬ 
proximation in Eq. (7.14). 

Note that since the probability you found is very small, it is also approxi¬ 
mately the probability of obtaining exactly one ball in the given box, because 
multiple events are extremely rare; see the discussion in the first remark in 
Section 4.6.2. 

4.21. Comparing probabilities *** 

(a) A hypothetical 1000-sided die is rolled three times. What is the proba¬ 
bility that a given number (say, 1) shows up all three times? 

(b) A million balls are thrown at random into a billion boxes. (So from the 
result in Problem 4.20, the probability that exactly one ball ends up in 
a given box is approximately 1/1000.) If this process (of throwing a 
million balls into a billion boxes) is performed three times, what (ap¬ 
proximately) is the probability that exactly one ball lands in a given box 
all three times? (It can be a different ball each time.) 

(c) A million balls are thrown at random into a billion boxes. This process 
is performed a single time. What (approximately) is the probability that 
exactly three balls end up in a given box? Solve this from scratch by 
using a counting argument. 

(d) Solve part (c) by using the Poisson distribution. 

(e) The setups in parts (b) and (c) might seem basically the same, because 
both setups involve three balls ending up in the given box, and there is 
a 1 lb - 1/10 9 probability that any given ball ends up in the given box. 
Give an intuitive explanation for why the answers differ. 


Section 4.8: Gaussian distribution 

4.22. Area under a Gaussian curve ** (calculus) 

Show that the area (from -oo to oo) under the Gaussian distribution, fix) = 
xfbfn e~ bx , equals 1. That is, show that the total probability equals 1. (We 
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have set p = 0 for convenience, since p doesn’t affect the total area.) There is 
a very sneaky way to do this. But since it’s completely out of the blue, we’ll 
give a hint: Calculate the square of the desired integral by multiplying it by 
the integral of sjb/n e~ by ~. Then make use of a change of variables from 
Cartesian to polar coordinates, to convert the Cartesian double integral into a 
polar double integral. 

4.23. Variance of the Gaussian distribution ** (calculus) 

Show that the variance of the second Gaussian expression in Eq. (4.42) equals 
cr 2 (which means that the standard deviation is cr). You may assume that 
p — 0 (because /a doesn’t affect the variance), in which case the expression 
for the variance in Eq. (3.19) becomes E(X 2 ). And then by the reasoning 
in Problem 4.2, this expectation value is f x 2 f(x ) dx. So the task of this 
problem is to evaluate this integral. The straightforward method is to use 
integration by parts. 


4.12 Solutions 

4.1. Fahrenheit and Celsius 

A density is always given in terms of “something per something else." In the temper¬ 
ature example in Section 4.2, the “units” of probability density were probability per 
Fahrenheit degree. These units are equivalent to saying that we need to multiply the 
density by a certain number of Fahrenheit degrees (the AT) to obtain a probability; see 
Eq. (4.2). Analogously, we need to multiply a mass density (mass per volume) by a 
volume to obtain a mass. 

If we want to instead write the probability density in terms of probability per Celsius 
degree, we can’t simply use the same function p(T) that appears in Fig. 4.4. Since 
there are 1.8 Fahrenheit degrees in each Celsius degree, the correct plot of p(T) is 
shown in Fig. 4.27. 


P (T) 



Figure 4.27: Expressing Fig. 4.4 in terms of Celsius instead of Fahrenheit. 

This plot differs from Fig. 4.4 in three ways. First, since the peak of the curve in 
Fig. 4.4 was at about 68 degrees Fahrenheit, it is now shifted and located at about 
(5/9) (68 - 32) = 20 degrees Celsius in Fig. 4.27. 
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Second, compared with Fig. 4.4, the curve in Fig. 4.27 is contracted by a factor of 
1.8 in the horizontal direction due to the conversion from Fahrenheit to Celsius. The 
span is only about 11 Celsius degrees, compared with a span of about 20 Fahrenheit 
degrees in Fig. 4.4. This follows from the fact that each Celsius degree is worth 1.8 
Fahrenheit degrees. 

Third, since the area under the entire curve in Fig. 4.27 must still be 1, the curve must 
also be expanded by a factor of 1.8 in the vertical direction. So the maximum value is 
about 0.13, compared with the maximum value of about 0.07 in Fig. 4.4. 

Remark: These contraction and expansion countereffects cause the probabilities we 
calculate here to be consistent with ones we calculated in Section 4.2. For example, 
we found in Eq. (4.3) that the probability of the temperature falling between 70 °F and 
71 °F is about 7%. Now, 70°F and 71 °F correspond to 21.11 °C and 21.67°C, as 
you can show using C = (5/9 )(F - 32). So the probability of the temperature falling 
between 21.11 °C and 21.67 °C had better also be 7%. It’s the same temperature 
interval; we’re just describing it in a different way. And indeed, front the Celsius 
plot, the value of the density near 21° is about 0.12. Therefore, the probability of 
falling between 21.11 °C and 21.67 °C, which equals the density times the interval, is 
(0.12)(21.67 - 21.11) = 0.067 * 7%, in agreement with the Fahrenheit calculation 
(up to the rough readings we made from the plots). If we had forgotten to expand the 
height of the curve by the factor of 1.8 in Fig. 4.27, we would have obtained only about 
half of this probability, and therefore a different answer to exactly the same question 
(asked in a different language). That wouldn’t be good. * 

4.2. Expectation of a continuous distribution 

For a general probability density p(x ), the probability associated with a span dx 
around a given value of x is p{x) dx; this is true by the definition of the probability 
density. Now. the expectation value of a discrete random variable is given in Eq. (3.4). 
To extract from this expression the expectation value of a continuous random variable, 
we can imagine dividing up the x axis into a very large number of little intervals dx. 
The probabilities p,- in Eq. (3.4) get replaced with the various p(x) dx probabilities. 
And the outcomes x, in Eq. (3.4) get replaced with the various values of x. 

In making these replacements, we're pretending that all of the x values in a tiny in¬ 
terval dx are equal to the value at, say, the midpoint (call it xf). This X{ then occurs 
with probability p,- = p(x,) dx. We therefore have a discrete distribution that in the 
dx —r 0 limit is the same as the original continuous distribution. The discreteness of 
our approximate distribution allows us to apply Eq. (3.4) and say that the expectation 
value equals 

Expectation value = ^p,x t - = ^ (p(x,-) dx)xp (4.54) 

In the dx —> 0 limit, this discrete sum turns into the integral. 


Expectation value = 


J' ( p(x) dx) x ■f xp(x) dx. 


(4.55) 


as desired. This is the general expression for the expectation value of a continuous 
random variable. The limits of the integral are technically -oo to oo, although it is 
often the case that p(x) = 0 everywhere except in a finite region. For example, the 
density p(t ) for the exponential distribution is zero for t < 0, and it becomes negligibly 
small for t » r, where r is the average waiting time. 
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The above result generalizes to the expectation value of things other than x. For 
example, the same reasoning shows that the expectation value of x 2 (which is relevant 
when calculating the variance) equals f x 2 p(x) dx. And the expectation value of x 1 
(if you ever happened to be interested in such a quantity) equals fx 7 p(x) dx. 

4.3. Variance of the uniform distribution 

First solution: Since the nonzero part of the distribution has length a on the x axis, 
the value of the distribution in that region must be 1 /a, so that the total area is 1. We’ll 
use the E(X 2 ) - p 2 form of the variance in Eq. (3.34), with p = a/2 here. Our task is 
therefore to calculate E(X 2 ). From the last comment in the solution to Problem 4.2, 
this equals 


E(X~) = f x-p{x) dx = 

Jo 

The variance is then 

-i o a 2 ta \ 2 

Nw(X) = E(X 2 ) - p~ = — -(-) = 
The standard deviation is therefore a/( 2 V3 ) * (0.29)a. 


r o i i 

I x~ ■ — dx = — ■ 
Jo a a 


a 

T 


a~ 

12 


(4.56) 


(4.57) 


Second solution: Let’s shift the distribution so that it is nonzero from x = -a/2 to 
x = a/2. This shift doesn't affect the variance, which is now simply E(X 2 ), because 
p = 0. So 


Var(X) = E(X~) = 


x 2 p(x) dx = — ■ — 
a 3 


-X 

s((i) 3 -(-!) : 


a/2 
—a/2 


a/2 

-a/2 


a 

12 


(4.58) 


Third solution: We can use the E[(X - p) 2 ] form of the variance in Eq. (3.19), with 
the original 0 < x < a span. This gives 


Var(X) =E[(X-a/2) 


■i-r 

Jo 

-if 

a Jo 


(x - a/2)~p(x) dx 


{x 2 - ax + a 2 / 4) dx 


0 

a Jo 
l I a 

—-a-1-a| = — . 

12 


a \ 3 


(4.59) 


4.4. Expectation of the binomial distribution 

First solution: The k = 0 term doesn't contribute anything to the sum in Eq. (3.4), so 
we can start with the k = 1 term. The sum goes up to k = n. Plugging the probabilities 
from Eq. (4.6) into Eq. (3.4), we obtain an expectation value of 

n 

k ■ P(k) = Yj k ■ 
k=1 k=l 

If the factor of k weren't on the righthand side, we would know how to evaluate this 
sunt; see Eq. (4.10). So let's get rid of the k and create a sum that looks like Eq. (4.10). 


'^p k (l-p) n - k . (4.60) 
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The steps are the following. 


£ k ■ P(k) 
k =1 
n 

= 1 * 


p*(1-p)"“* 


k =1 


k\(n-k)\ 


(expanding the binomial coeff.) 


ft / i \ | 

= pn ^ fc • " _ k y P k ~\ l - P) n ~ k (factoring out pn) 

k =1 

w / 1^1 

= pn Ti -TV77—"TTt/” 1 (! - P)” _k (canceling the k) 

f-\ (k - 1)!(n - A;)! 

K = 1 

= pn Xj (”, I J0 1 - (rewriting) 

= P n X (” p'C 1 -p/" -0-7 (defining j = k - 1) 

7=0 ^ 7 7 

= pn(p + (1 -p))" -1 (using the binomial expansion) 

= pn ■ 1, (4.61) 


as desired. Note that in the sixth line, the sum over j goes from 0 to n - 1, because the 
sum over k went from 1 to n. 

Even though we know that the expectation value has to be pn (as mentioned in the 
statement of the problem), it’s nice to see that the math does in fact work out. 


Second solution: Here is another (sneaky) way to obtain the expectation value. This 
method uses calculus. The binomial expansion tells us that 

(p + q)" = Y J ^p k q n ~ k . (4.62) 

This relation is identically true for arbitrary values of p (and q), so we can take the 
derivative with respect to p to obtain another valid relation: 

n(p + q) n ~ l = f J k( n \p k - 1 q n - k . (4.63) 

k= 1 ' ' 

If we now multiply both sides by p and then set q to equal 1 - p (the relation is true 
for all values of q , in particular this specific one), we obtain 

n / v n 

np(l)"- 1 = Yjk " \p k D-p) n ~ k => np=Y i k - p (. k ). ( 4 - 64 ) 

k =1 ' ' k =1 


as desired. 

4.5. Variance of the binomial distribution 

First solution: As suggested in the statement of the problem, let’s find the expecta¬ 
tion value of k{k - 1). Since we've already done a calculation like this in Problem 4.4, 
we won't list out every step here as we did in Eq. (4.61). The k = 0 and k — 1 terms 
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don’t contribute anything to the expectation value of k(k - 1), so we can start the sum 
with the k = 2 term. We have (with j = k — 2 in the 5th line) 

n 

^ k(k -1) • P(k) 
k= 2 

k=2 ' 

= p 2 n(n - 1) £ - ( ”~ 2)! ,./ - 2 (l-P)"- fc 
^ (k-2)\(n-k)\ 

k=2 ' ' 

= p 2 n(n - 1) 2 (" “ 2 y (1 - p)(»-2)-y 

= p 2 n(n - 1)0 + (1 - p )) n ~ 2 

= p 2 n{n - 1) • 1. (4.65) 

The expectation value of £ 2 equals the expectation value oi k(k - 1) plus the expec¬ 
tation value of k. The latter is just pn, from Problem 4.4. So the expectation value of 

k 2 is 

p 2 n(n - 1) + pn. (4.66) 

To obtain the variance, Eq. (3.34) tells us that we need to subtract off p 2 = (pn) 2 from 
this result. The variance is therefore 

(j) 2 n(n - 1) + pn ) - p 2 n 2 = (p 2 n 2 - p 2 n + pn'j - p 2 n 2 

= pn( 1 -p) 

= npq, (4.67) 

as desired. The standard deviation is then yfnpq. 


Second solution: Instead of taking just one derivative, as we did in the second solution 
in Problem 4.4, we'll take two derivatives here. Starting with the binomial expansion, 

(p + q) n = Y J {fyp k q n ~ k , (4.68) 

we can take two derivatives with respect to p to obtain 


r(n - l)(p + q) n ~ 2 = k(k - 1 y P k ~ 2 q n ~ k - 


If we now multiply both sides by p 2 and then set q to equal 1 - p, we obtain 


p 2 n(n - 1)(1)” -1 = g*(*- !)(”)/(! -pV 

n 

=> p 2 n(n — 1) = £ *(*“!)• p (*)- 
k=2 


(4.70) 
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The expectation value of k(k- 1) is therefore p 2 n(n- 1), in agreement with Eq. (4.65) 
in the first solution. The solution proceeds as above. 

4.6. Hypergeometric distribution 

(a) There are possible sets of n balls (drawn without replacement), and all of 
these sets are equally likely to be drawn. We therefore simply need to count the 
number of sets that have exactly k red balls. There are ways to choose k 

red balls from the K red balls in the box. And there are (^1^) ways to choose 
the other n - k balls (which we want to be blue) from the N - K blue balls in 
the box. So the number of sets that have exactly k red balls is (^ ) ■ The 

desired probability of obtaining exactly k balls when drawing n balls without 
replacement is therefore 


P(k ) = 


K\ IN - K' 
k )\ n — k 


(N\ 


w 


(Hypergeometric distribution) 


(4.71) 


Remark: Since the number of red balls, k, that you draw can't be larger than 
either K or n, we see that P(k) is nonzero only if k < min(A',/r). Likewise, the 
number of blue balls, n — k, that you draw can’t be larger than either N - K or n. 
So P(k) is nonzero only if n — k < mini A - K,n) ==> n - min(A - K,ri) < k. 
Putting these inequalities together, we see that P(k) is nonzero only if 

n - min(A — K,ri) < k < mm{K,ri). (4.72) 


If both K and N - K are larger than n, then this reduces to the simple relation, 
0 < k < n. * 

(b) If N and K are small, then the probabilities of drawing red/blue balls change 
after each draw. This is true because you aren't replacing the balls, so the ratio 
of red and blue balls changes after each draw. 

However, if N and K are large, then the “without replacement” qualifier is in¬ 
consequential. The ratio of red and blue balls remains essentially unchanged 
after each draw. Removing one red ball from a set of a million red balls is 
hardly noticeable. The probability of drawing a red ball at each stage therefore 
remains essentially fixed at the value K/N. Likewise, the probability of drawing 
a blue ball at each stage remains essentially fixed at the value (N - K)/N. If 
we define the red-ball probability as p = K/N , then the blue-ball probability 
is 1 — p. We therefore have exactly the setup that generates the binomial dis¬ 
tribution, with red corresponding to success, and blue corresponding to failure. 
Hence we obtain the binomial distribution in Eq. (4.6). 

Let’s now show mathematically that the hypergeometric distribution in Eq. (4.71) 
reduces to the binomial distribution in Eq. (4.6). Expanding the binomial coef¬ 
ficients in Eq. (4.71) gives 


K\ 


( N-K)\ 


P(k) = 


k\(K - k)\ (n - k)\((N - K) - (n - k ))! 


AM 


(4.73) 


n\(N - ri)\ 


If K » k, then we can say that 
K\ 


(K - k)\ 


: K(K - i)(K - 2) • • • (K - k + 1) * K ‘\ 


(4.74) 
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This is true because all of the factors here are essentially equal to K, in a mul¬ 
tiplicative sense. (The sign in A' » k means “much greater than” in a 
multiplicative, not additive, sense.) We can make similar approximations to 
(N - K)\/((N - K) - (n - k ))! and N\/(N - n)\, so Eq. (4.73) becomes 


P(k) * 


K k 

Id 


(. N-K) n ~ k 
(n - k)\ 
N n 
n\ 


n\ lK\ k lN-K\ n ~ k 
k\(n-k)\ \N) \ N ) 




(4.75) 


where p = K/N. This is the desired binomial distribution, which gives the 
probability of k successes in n trials, where the probability of success in each 
trial takes on the fixed value of p. 

We made three approximations in the above calculation, and they relied on the 
three assumptions. 


(1) K » k. (2 )N-K»n- k, (3) N » n. (4.76) 


In words, these three assumptions are: (1) the number of red balls you draw is 
much smaller than the total number of red balls in the box, (2) the number of 
blue balls you draw is much smaller than the total number of blue balls in the 
box. and (3) the total number of balls you draw is much smaller than the total 
number of balls in the box. (The third assumption follows from the other two.) 
These three assumptions are what we mean by “(V and K are very large.” 

4.7. Expectation of the geometric distribution 

From Eq. (4.14), the probability that we need to wait just one iteration for the next 
success is p. For two iterations it is (1 - p)p , for three iterations it is (1 - p) 2 p, and 
so on. The expectation value of the number of iterations (that is, the waiting time) is 
therefore 

\-p + 2- { \ - p)p + 3-(1 - p)~p + 4-(l - p) 2 p + ■ ■ ■ . (4.77) 

To calculate this sum, we’ll use the trick we introduced in Problem 3.1 and write the 
sum as a geometric series starting with p, plus another geometric series starting with 
(1 -p)p, and so on. And we’ll use the fact that the sunt of a geometric series with first 
term a and ratio r is a/( 1 - r ). The expectation value in Eq. (4.77) then becomes 

p + (1 - p)p + (1 - p) 2 p + ( 1 - p) 2 p + ■ ■ ■ 

(l-p)p + (l-p) 2 p + a-p) 3 p + --- 
(i-p) 2 p + a-p) 3 p + ■■■ 

(1 -P) 3 P + --- (4.78) 


This has the correct number of each type of term. For example, the (1 - p) 2 p term 
appears three times. The first line above is a geometric series that sums to a/( 1 - r) = 
p/(l — (1— p)) = 1. The second line is also a geometric series, and it sums to 
(1 —p)p/(l — (1 —p)) = 1 —p. Likewise the third line sums to (1 -p) 2 p/(l -(1 —p)) = 
(1 - p) 2 . And so on. The sum of the infinite number of lines in Eq. (4.79) therefore 
equals 

l + (l-p) + (l-p) 2 + (l-p) 3 + --- . 


(4.79) 
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But this itself is a geometric series, and it sums to a/(l - r) = 1/(1 — (1— p)) = 1/p, 
as desired. 

4.8. Properties of the exponential distribution 


(a) The total probability equals the total area under the distribution curve. And this 
area is given by the integral of the distribution. The integral of e _f ' T /r equals 
-e _r ' r , as you can verify by taking the derivative (and using the chain rule). 
The desired integral is therefore 




| OO 

lo 


as desired 


-e~°° -{-e~ G ) = 1, 


(4.80) 


(b) This is very similar to part (a), except that we now want the probability from 0 
to f mec [ to equal 1 /2. That is. 


1 

2 




l^med 

lo 


-e tmed/T - ( - e °). 


(4.81) 


This yields e fmed / T = 1/2. Taking the natural log of both sides then gives 


-tmed/T = ~ In2 => f me d = (ln2)r * (0.7)r. (4.82) 


So the median value of t is (0.7)r. In other words, (0.7)r is the value of t for 
which the two shaded areas in Fig. 4.28 are equal. 


P(0 



Figure 4.28: The areas on either side of the median are equal. 


Note that the median value of t, namely (0.7)r, is smaller than the mean value 
(the expectation value) of t , namely r. The reason for this is that the exponential 
distribution has a tail that extends to large values of t. These values of t drag the 
mean to the right, more so than the small values of t near zero drag it to the left 
(because the former are generally farther from f mec j than the latter). Whenever 
you have an asymmetric distribution like this, the mean always lies on the “tail 
side” of the median. 

(c) In the specific case of the exponential distribution, Eq. (4.55) in the solution to 
Problem 4.2 gives 


Expectation value = 



o- f / T 


r 


dt. 


(4.83) 
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You can evaluate this integral by performing “integration by parts,” or you can 
just look it up. It turns out to be 


Expectation value = —(t + T)e f ^ r 

= -(oo + r)e _ °° 
= 0 + r, 


10 

+ (0 + r)e~° 


(4.84) 


as desired. In the first term here, we have used the fact that the smallness of 
e~°° wins out over the largeness of oo. You can check this on your calculator by 
replacing oo with, say, 100. 

(d) Let’s use T to denote the random variable whose value is t. Since the mean of the 
exponential distribution is r, Eq. (3.34) tells us that the variance is E(T 2 ) - r 2 . 
So we need to find E(T 2 ). Eq. (4.55) gives 

o f°° i e~ t/T 

E(T 2 ) = t 2 - dt. (4.85) 

Jo T 

Evaluating this by integration by parts is rather messy, so let’s just look up the 
integral. It turns out to be 


lOO 

E(T 2 ) = -(t 2 + 2 Tt + 2 r 2 )e“ f/r 

lo 

= -0 + (0 + 0 + 2T 2 )e“° 

= It 1 . (4.86) 


As in part (c), we have used the fact that the smallness of e °° makes the term 
associated with the upper limit of integration be zero. The variance is therefore 

Var(T) = E{T 2 ) - r 2 = 2r 2 - t 2 = r 2 . (4.87) 

The standard deviation is the square root of the variance, so it is simply r, which 
interestingly is the same as the mean. As with all other quantities associated with 
the exponential distribution, the variance and standard deviation depend only on 
r, because that is the only parameter that appears in the distribution. 


4.9. Total probability 

The sum over k ranges from 0 to oo. The upper limit is oo because with continuous time 
(or space, or whatever), theoretically an arbitrarily large number of events can occur 
in a given time interval (although if k is much larger than a, then P{k) is negligibly 
small). We have (invoking Eq. (7.7) from Appendix B to obtain the third line) 


Z p(k} 


k =0 


k=0 


k\ 


, y a 

2-i ~k\ 
k =0 


= 1, 


(4.88) 


as desired. You are encouraged to look at the derivation of Eq. (7.7) in Appendix B. 
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4. 10. Location of the maximum 


First solution: In this solution we’ll use the fact that the expression for P(k) in 
Eq. (4.40) is actually valid for all positive values of k, not just integers (even though 
we're really only concerned with integers). This is due to the fact that it is possible 
to extend the meaning of k! to non-integers. We can therefore treat P{k) as a con¬ 
tinuous distribution. The maximum value of this distribution might not occur at an 
integer value of k, but we’ll be able to extract the appropriate value of k that yields 
the maximum when k is restricted to integers. 

A convenient way to narrow down the location of the maximum of P{k) is to set 
P(k) = P{k+ 1). (In calculus, this is equivalent to finding the maximum by setting the 
first derivative equal to zero.) This will tell us roughly where the maximum is, because 
this relation can hold only if k and k + 1 are on opposite sides of the maximum. This is 
true because the relation P{k) = P(k +1) can't be valid on the right side of the curve’s 
peak, because the curve is decreasing there, so all those points have P{k) > P{k + 1). 
Similarly, all the points on the left side of the peak have P(k) < P(k + 1). The only 
remaining possibility is that k is on the left side and k + 1 is on the right side. That is, 
they are on opposite sides of the maximum. 


Setting P{k) - P(k + 1) gives (after canceling many common factors to obtain the 
second line) 


P(k) = P(k + 1) 


a k e~ a _ a k+l e~ a 
k\ ~ (k + 1)! 


1 a 

T “ k + 1 

k + 1 = a 


=> k = a - 1. 


(4.89) 


The two relevant points on either side of the maximum, namely k and k + 1, are 
therefore a — 1 and a. So the maximum of the P(k) plot (extended to non-integers) 
lies between a — 1 and a. Since we’re actually concerned only with integer values of k, 
the maximum is located at the integer that lies between a - 1 and a (or at both of these 
values if a is an integer). In situations where a is large (which is often the case), the 
distinction between a — 1 and a (or somewhere in between) isn't all that important, so 
we generally just say that the maximum of the probability distribution occurs roughly 
at a. 


Second solution: We can avoid any issues about extending the Poisson distribution 
to non-integer values of k, by simply finding the integer value of k for which both 
P(k) > P(k + 1) and P(k) > P(k - 1) hold. P(k) is then the maximum, because it is 
at least as large as the two adjacent P(k ± 1) values. 

By changing the “=” sign in Eq. (4.89) to a “>” sign, we immediately see that P{k) > 
P{k + 1) implies k> a — 1. And by slightly modifying Eq. (4.89), you can show that 
P{k) > P{k - 1) implies a > k. Combining these two results, we see that the integer 
value of k that yields the maximum P(k) satisfies a - 1 < k < a. The desired value 
of k is therefore the integer that lies between a - 1 and a (or at both of these values if 
a is an integer), as we found in the first solution. 

4.11. Value of the maximum 

Since we know from the previous problem that the maximum of P(k) occurs essen¬ 
tially at k = a, our goal is to find P(a). Stirling’s formula allows us to make a quick 
approximation to this value. Plugging k = a into Eq. (4.40) and using Stirling’s for- 
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mula, n ! * n" e " s[2n~n, yields 


P(a) = 


a\ 


a a e a sllna 



(4.90) 


We see that the height is proportional to 1/ sfa. So if a goes up by a factor of, say, 4, 
then the height goes down by a factor of 2. 

It is easy to make quick estimates using this result. Consider the a = 10 plot in 
Fig. 4.21. The maximum is between 0.1 and 0.2, a little closer to 0.1. Let’s say 0.13. 
And indeed, if a = 10 (for which Stirling’s formula is quite accurate, front Table 2.6), 
Eq. (4.90) gives 

P(10) * , 1 ~ 0.126. (4.91) 

s/2n(W) 

This is very close to the exact value of P(10), which you can show is about 0.125. 
Since a is an integer here (namely, 10), Problem 4.10 tells us that P( 9) takes on this 
same value. 

4.12. Expectation of the Poisson distribution 

The expectation value is the sum of k ■ P(k), from k = 0 to k = oo. However, the 
k = 0 term contributes nothing, so we can start the sum with the k = 1 term. Using 
Eq. (4.40), the expectation value is therefore 


|>m) = |\ 


k =1 


k =1 


k\ 


Z ci K (’ 

Tk^ 


k =1 


(*-D! 


(canceling the k) 


a k ~ 1 e~ a 

a ■ y — - 7 TT- (factoring out an a) 


z 

k= 1 


(k - 1)! 


^ aJe~ a 

'■ a ' 2_j —3— (defining j = k- 1) 


1=0 


a 'Yj P( 7) (using Eq. (4.40)) 
1=0 

a ■ 1, (total probability is 1) 


(4.92) 


as desired. In the fourth line, we used the fact that since j = k — 1, the sunt over j 
starts with the j = 0 term, because the sunt over k started with the k = 1 term. If you 
want to show explicitly that the total probability is 1, that was the task of Problem 4.9. 

4.13. Variance of the Poisson distribution 

As suggested in the statement of the problem, let’s find the expectation value of k(k - 
1). Since we’ve already done a calculation like this in Problem 4.12, we won’t list out 
every step here as we did in Eq. (4.92). The k = 0 and k — 1 terms don’t contribute 
anything to the expectation value of k{k - 1), so we can start the sum with the k = 2 
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term. We have (with j = k - 2 in the 3rd line) 

CO oo 

Yj *(* - !) • p (k) = ^ *(* - 1 ) ■ 


k=2 


k=2 

00 ~k—2~—a 
s \~s a e 


k\ 


z 

k=2 


(k- 2)\ 


a 1 e 


■■ a " 




: 


7=0 

OO 

2 • Z 

7=0 


: CC ■ 1 . 


(4.93) 


The expectation value of k 2 equals the expectation value of k(k - 1) plus the expec¬ 
tation value of k. The latter is just a, from Problem 4.12. So the expectation value of 
k 2 is a 2 + a. To obtain the variance, Eq. (3.34) tells us that we need to subtract off 
p 2 = a 2 from this result. The variance is therefore 

(a 2 + a) - a 2 = a, (4.94) 

as desired. The standard deviation is then s/a. 

We will show in Section 5.3 that the standard deviation of the Poisson distribution 
equals s/a when a is large (when the Poisson looks like a Gaussian). But in this 
problem we demonstrated the stronger result that the standard deviation of the Poisson 
distribution equals sfa for any value of a , even a small one (when the Poisson doesn’t 
look like a Gaussian). 


Remark: We saw in Problem 4.11 that for large a, the height of the bump in the 
Poisson P(k) plot is 1/ V2 na, which is proportional to 1/ s/a. The present a = s/a 
result is consistent with this, because we know that the total probability must be 1. 
For large a, the P(k) plot is essentially a continuous curve, so we need the total area 
under the curve to equal 1. A rough measure of the width of the bump is 2a. The area 
under the curve equals (roughly) this width times the height. The product of 2<x and 
the height must therefore be of order 1. And this is indeed the case, because a = s/a 
implies that (2a)(l/ sjlna ) = sfTpn, which is of order 1. This order-of-magnitude 
argument doesn’t tell us anything about specific numerical factors, but it does tell us 
that the height and standard deviation must have inverse dependences on a. * 


4.14. 


Poisson accuracy 

Replacing a with pn in the Poisson distribution in Eq. (4.40), and setting k = pn as 
instructed, gives 


Pp(pn) = 


( pn)P n e~P n 
(pn)! 


(4.95) 


Similarly, setting k = pn in the exact binomial expression in Eq. (4.32) gives 


Pp(pn) = 


jp pn (l -pf- pn 

p pn {\-p) n - pn . 


( pn)\{n - pri)\ 


(4.96) 
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The (pn)\ term here matches up with the ( pn)\ term in Pp(pn), so it will cancel in 
the ratio Pp(pn)! Pp(pn). Let’s apply Stirling’s formula, m\ ~ m m e~ m sl2nm, to the 
other two factorials in Pp(pn). Since n - pn = n{ 1 - p), we obtain (we’ll do the 
simplification gradually here) 


P B (pn) 


n n e " sjlnn 


(pn)l • («(1 -p)) ,!< ^ p ^e 'fi 1 Pi yj2nn(l - p) 
n n e~ n 


■ pP n (l - p) n( - l ~Pl 


• pP n 


( pn)l ■ n u(l Pie ,1(1 P* yjl — p 
1 

(pn)\ ■ n~P n eP n yjl - p 
1 (pn)P n e~P n 
s/l-p (P n V- 


■ pP" 


(4.97) 


This result fortuitously takes the same form as the Pp(pn) expression in Eq. (4.95), 
except for the factor of 1/ -y/l - p out front. The desired ratio is therefore simply 


Pp(pn) 

Pb( pn) 


V 1 -p- 


(4.98) 


This is the factor by which the peak of the Poisson plot is smaller than the peak of the 
(exact) binomial plot. 

In the two plots in Fig. 4.20, the p values are 1/10 and 1/100, so the y/1 —p ratios 
are V0.9 « 0.95 and V0.99 * 0.995. These correspond to percentage differences of 
5% and 0.5%, or equivalently to fractional differences of 1/20 and 1/200. These are 
consistent with a visual inspection of the two plots; the 0.5% difference is too small to 
see in the second plot. 

With the above yj l— p result, we can say that the Poisson approximation is a good 
one if y/l - p is close to 1, or equivalently if p is much smaller than 1. How much 
smaller? That depends on how good an approximation you want. If you want accuracy 
to 1%, then p = 1/100 works, but p = 1/10 doesn't. 


Remarks: 

1. A helpful mathematical relation that is valid for small p is yj 1 ~P ~ 1 -p/2. 
(You can check this by plugging a small number like p = 0.01 into your calcu¬ 
lator. Or you can square both sides to obtain 1 - p ~ 1 — p + p~/ 4, which is 
correct up to the quadratic p 2 /4 difference, which is very small if p is small.) 
With this relation, our J 1 —p result becomes 1 - p/2. The difference between 
this result and 1 is therefore p/2. This makes it clear why we ended up with the 
above ratios of 0.95 and 0.995 for p = 1/10 and p = 1/100. 

2. Note that our “goodness” condition for the Poisson approximation involves only 
p. That is, it is independent of n. This isn’t terribly obvious. For a given value 
of p (say, p = 1/100), we will obtain the same accuracy whether n is, say, 10 3 
or 10 5 . Of course, the a = pn expected values in these two cases are different 
(10 and 1000). But the ratio of Pp(pn) to Pp(pn) is the same (at least in the 
Stirling approximation). 

3. In the language of balls and boxes, since p = 1/b, the p <K 1 condition is 
equivalent to saying that the number of boxes satisfies b » 1. So the more 
boxes there are, the better the approximation. This condition is independent of 
the number n of balls (as long as n is large). 
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4. The result in Eq. (4.98) is valid even if the expected number of events pn is 
small, for example, 1 or 2. The is true because the (pn) ! terms cancel in the ratio 
of Eqs. (4.95) and (4.96), so we don't need to worry about applying Stirling’s 
formula to a small number. The other two factorials, ill and (n - pn)\, are large 
because we are assuming that n is large. * 

4.15. Bump or no bump 

We saw in Section 4.7.2 that the Poisson distribution is obtained by taking the n —* oo 
and p —* 0 limits of the binomial distribution (p took the form of Ae in the derivation 
in Section 4.7.2). But in then —» oo limit, the p = l/(n + 1) condition for P(0) = P(l) 
in the binomial case becomes p ~ 1 jn. So pn ~ 1. But pn is just the average number 
of events a in the Poisson distribution. So a as 1 is the condition for P(0) = P( 1) in 
the Poisson case, as desired. 

4.16. Typos 

First solution: Under the assumption that the typos occur randomly, the given setup 
calls for the Poisson distribution. If the expected number of typos in 50 pages is 
one, then the expected number of typos in a 350-page book is a = 350/50 = 7. So 
Eq. (4.40) gives the probability of zero typos in the book as 

cf* e~ a 

P(0) = —— = = e“ 7 * 9 • 10" 4 * 0.1%. (4.99) 

Second solution: (This is an approximate solution.) If there is one typo per 50 pages, 
then the expected number of typos per page is 1/50. This implies that the probability 
that a given page has at least one typo is approximately 2%, which means that the 
probability that a given page has zero typos is approximately 98%. We are using the 
word “approximately” here, because the probability of zero typos on a given page 
must in fact be slightly larger than 98%. This is true because if it were exactly 98%, 
then in the 2% of the pages where a typo occurs, there might actually be two (or 
three, etc.) typos. Although these occurrences are rare in the present setup, they will 
nevertheless cause the expected number of typos per page to be (slightly) larger than 
1/50, in contradiction to the given assumption. The actual probability of having zero 
typos per page must therefore be slightly larger than 98%, so that slightly fewer than 
2% of the pages have at least one typo. 

However, if we work in the (reasonable) approximation where the probability of hav¬ 
ing zero typos per page equals 0.98. then the probability of having zero typos in 350 
pages equals (0.98) 350 = 8.5 ■ 10~ 4 . This is close to the correct probability of 9 • 10 -4 
in Eq. (4.99). Replacing 0.98 with a slightly larger number would yield the correct 
probability of 9 • 10 -4 . 

Remarks: What should the probability of 0.98 (for zero typos on a given page) be 
increased to, if we want to obtain the correct probability of 9 • 10 -4 (for zero typos in 
the book)? Since the expected number of typos per page is 1 /50, we simply need to 
plug a = 1/50 into the Poisson expression for P( 0). This gives the true probability of 
having zero typos on a given page as 

0 —a 

P( 0) = —— = e~ a = e“ 1/5 ° = 0.9802. (4.100) 

As expected, this is only a tiny bit larger than the approximate value of 0.98 that we 
used above. If we use the new (and correct) value of 0.9802, the result of our second 
solution is modified to (0.9802) 350 = 9 • 10 -4 , which agrees with the correct answer 
in Eq. (4.99). 
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The relation between the (approximate) second solution and the (correct) first solution 
can be seen by writing our approximate answer of (0.98) 350 as 

<0 98)350 = (‘ - sf° ■ ((■ - 5>f) ” ( '"' )7 ■ ^ (4J01) 

which is the correct answer in Eq. (4.99). We have used the approximation in Eq. (7.4) 
to produce the e~ l term here. * 

4.17. Boxes with zero balls 

First solution: The given information that 20 out of the 1000 boxes contain zero 
balls (on average) tells us that the probability that a given box contains zero balls is 
P( 0) = 20/1000 = 0.02. The process at hand is approximately a Poisson process, just 
as the balls-in-boxes setup in the example on page 213 was. We therefore simply need 
to find the value of a in Eq. (4.40) that makes P( 0) = 0.02. That is, 

a°e~ a 

——— = 0.02 => e a = 50 => a = ln50 = 3.912. (4.102) 

This a is the average number of balls in each of the 1000 boxes. The total number of 
balls in each trial is therefore n = (1000)a = 3912. 

Note that once we know what a is, we can determine the number of boxes that contain 
other numbers of balls. For example P(3) ~ (3.9) 3 e —3_9 /3! ~ 0.20. So about 200 
of the 1000 boxes end up with three balls, on average. P(4) is about the same (a hair 
smaller). About 4.5 boxes (on average) end up with 10 balls, as you can show. 

Second solution: We can solve the problem from scratch, without using the Poisson 
distribution. With k = 0, Eq. (4.33) tells us that the probability of obtaining zero balls 
in a given box is P(0) = (1 - 1/1000)”. Setting this equal to 0.02 and using the 
approximation in Eq. (7.14) gives 

(1 - 1/1000)” = 0.02 => e ~ n ! 1000 = 0.02 => e" /100 ° = 50 

=> n/1000 = In 50 => n = 3912. (4.103) 

Alternatively, we can solve for n exactly, without using the approximation in Eq. (7.14). 
We want to find the n for which (999/1000)" = 0.02. Taking the log of both sides 
gives 

n ln(0.999) = ln(0.02) => n =- ~ 3 - 912 — = 3910 . (4.104) 

-1.0005-10“ 3 

Our approximate answer of n = 3912 was therefore off by only 2, or equivalently 
0.05%. 

4.18. Twice the events 

(a) This part of the problem is a repeat of Problem 4.11. The Poisson distribution is 
P a (k ) = a k e~ a /k\, so the probability of obtaining a events is (using Stirling’s 
formula for a!) 

a a g -a a a e -a _ | 

ci a e~ a sjlna s!2na 


P a l a ) — 


a\ 


(4.105) 
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(b) The average number of events during the time 2 1 is twice the average number 
during the time t. So we now have a Poisson process governed by an average of 
2a. The distribution is therefore P^aik), and our goal is to calculate P2 a (2a). 
In the same manner as in part (a), we find 


Pla(.2a) = 


(2 a) 2a 


e 


—2a 


(2a)! 


(2 a) 2a e~ 2a 
(2a) 2 “e _2 “ sJ2n(2a) 



(4.106) 


This is smaller than the result in part (a) by a factor of 1/ V2. In retrospect, we 
could have obtained the result of 1/ sj4na by simply substituting 2a for a in the 
1/ sj2na result in part (a). The setup is the same here; we’re still looking for the 
value of the distribution when k equals the average number of events. It’s just 
that the average is now 2 a instead of a. 

(c) Since we’re back to considering the original time t here, we're back to the Pois¬ 
son distribution with an average of a. But since k is now 2a, we want to calculate 
P a (2a). This equals 


Pa (2a) = 


a~ a e “ a^ a e “ 

(2a)! ~ (2a) 2a e~ 2a s/2n(2a) 

1 _ I e\ a 1 

2 -" e~ a V4tra ' 4 / sj4na 


(4.107) 


This is smaller than the result in part (a) by a factor of (e/4)“/ V2. The (e/4)“ 
part of this factor is approximately (0.68)“, which is very small for large a. 
For example, if a = 10, then (e/4) a ~ 0.02. And if a = 100, then (e/4)“ » 
1.7 ■ 10“ 17 . 

For a = 10, the above three results are summarized in Fig. 4.29. The three 
dots indicate (from highest to lowest) the answers to parts (a), (b), and (c). This 
figure makes it clear why the answer to part (c) is much smaller than the other 
two answers; the .Pio(20) dot is on the tail of a curve, whereas the other two 
dots are near a peak. Although we have drawn the Poisson distributions as 
continuous curves, remember that the distribution applies only to integer values 
of k. The two highest dots aren’t right at the peak of the curve, because the 
peak of the continuous curve is located at a value of k between a - 1 and a; see 
Problem 4.10. 


m 



Figure 4.29: The Poisson curves for a = 10 and a = 20. 
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4.19. P(0) the hard way 

The given interval (of time, space, or whatever) associated with the Poisson process 
has a expected events. Let’s divide the interval into a very large number n of tiny 
intervals, each with a very small probability p of an event occurring. For simplicity, 
we are using p here instead of the Ae we used at the beginning of Section 4.7.2. As in 
that section, we can ignore the distinction between the probability of an event in a tiny 
interval and the expected number of events in that interval, because we are assuming 
that p = Ae is very small; see Eq. (4.18). 

The tasks of Problems 2.2 and 2.3 were to derive the “Or” rules for three and four 
events. Our goal here is basically to derive the “Or” rule for a large number n of in¬ 
dependent events, each with a small probability p. These independent events are of 
course nonexclusive; we can certainly have more than one event occurring. Through¬ 
out this solution, you will want to have a picture like Fig. 2.17 in your mind. Although 
that picture applies to three events, it contains the idea for general n. Simple circles 
(each of which represents the probability that an event occurs in a given tiny interval) 
won’t work for larger n. but it doesn't matter what the exact shapes are. 

As in the solution to Problem 2.2(d), our goal is to determine the total area contained 
in the n partially overlapping regions (each with tiny area p) in the generalization of 
Fig. 2.17. The total area equals the probability of “Event 1 or Event 2 or ... Event n,” 
which is the desired probability that at least one event occurs in the original interval. 
As in the solution to Problem 2.2(d), we can proceed as follows. 

• If we add up the individual areas of all n tiny regions, we obtain np. (Each region 
represents the probability p that an event occurs in that particular tiny interval, 
with no regard for what happens with any of the other n - 1 tiny intervals.) But 
np equals the total expected number of events a in the original interval, because 
p is the expected number of events in each of the n tiny intervals. The sum of 
the individual areas of all n tiny regions therefore equals a. This a is the first 
term in the parentheses in Eq. (4.53). 

• However, in adding up the individual areas of all n tiny regions, we have double 
counted each of the overlap regions where two events occur. The number of 
these regions is = n(n - l)/ 2 , which essentially equals (in a multiplicative 
sense) n 2 /2 for large n. The area (probability) of each double-overlap region is 
p 2 , because that is the probability that two given events occur (with no regard for 
what else happens). The sum of the individual areas of the n 2 /2 double-overlap 
regions is therefore (n 2 /2)p 2 = (np) 2 /2 = a 2 /2. Since we have counted this 
area twice, and since we want to count it only once, we must correct for this by 
subtracting it off once. Hence the -a 2 / 2! term in the parentheses in Eq. (4.53). 

• We have now correctly determined the areas (probabilities) where exactly one 
or exactly two events occur. But what about the regions where three (or more) 
events occur? Each of these “triple” regions was counted ( 3 ) = 3 times when 

dealing with the “single” regions (because a triple region contains ( 3 j differ¬ 
ent single regions), but then uncounted ( 3 ) = 3 times when dealing with the 
“double” regions (because a triple region contains ( 3 ) different double regions). 
We have therefore counted each triple region _ (2) = 0 times. There are 
( 3 ) = n(n - 1 )(n - 2)/3! ~ n 2 / 3! of these regions. The area of each region is 
p 3 , because that is the probability that three given events occur (with no regard 
for what else happens). The sum of the individual areas of the n 2 / 3! triple re¬ 
gions is therefore (n 2 /3\)p 2 = {np) 2 / 3! = a 2 / 3!. Since we have not counted 
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this area at all, and since we want to count it once, we must correct for this by 
adding it on once. Hence the +a 3 /3! term in the parentheses in Eq. (4.53). 

• One more iteration for good measure: We have now correctly determined the 
areas (probabilities) where exactly one or exactly two or exactly three events 
occur. But what about the regions where four (or more) events occur? Each 
of these “quadruple” regions was counted (f) - 4 times when dealing with the 

single regions, then uncounted (j) = 6 times when dealing with the double 
regions, then counted ( 4 ) = 4 times when dealing with the triple regions. We 
have therefore counted each quadruple region ( 4 ) - ( 2 ) + ( 3 ) = 2 times. There 
are = n(n - 1 ){n - 2)(/? - 3)/4! » ?z 4 /4! of these regions. The area of each 
region is p 4 , because that is the probability that four given events occur (with 
no regard for what else happens). The sum of the individual areas of the ;z 4 /4! 
quadruple regions is therefore (zz 4 /4!)p 4 = (np) 4 /4! = a 4 /4!. Since we have 
counted this area twice, and since we want to count it only once, we must correct 
for this by subtracting it off once. Hence the -a 4 /4! term in the parentheses in 
Eq. (4.53). 

Continuing in this manner gives the entire area in Fig. 2.17, or rather, the entire area 
in the analogous figure for the case of n events instead of three. In the n —> 00 limit, 
we will obtain an infinite number of terms inside the parentheses in Eq. (4.53). All of 
the multiple counting is removed, so each region is counted exactly once. The total 
area represents the probability that at least one event occurs. Subtracting this from 1 
gives the probability P( 0) in Eq. (4.53) that zero events occur. 

As mentioned in the remark in the solution to Problem Eq. (2.3), we have either 
overcounted or undercounted each region once at every stage. This is the inclusion- 
exclusion principle , and it follows from the binomial expansion of 0 = (1 — l) m . Using 
the expansion in Eq. (1.21) with a = 1 and b = —1, we have 


( 1 - 1 )"’ 





(-l ) m_1 


+ 



(-l) m . (4.108) 


The lefthand side equals zero, and the and terms equal 1, so we obtain 




(- l )"*- 1 


1 + (- 1 )" ! . 


(4.109) 


From the pattern of reasoning in the above bullet points, the lefthand side here is the 
number of times we have already counted each zzz-tuple region, in our handling of all of 
the ‘lesser” regions - the single regions up through the (in - l)-tuple regions. (We are 
assuming inductively that we have overcounted or undercounted by 1 at each earlier 
stage.) The righthand side is either 2 or 0, depending on whether m is even or odd. We 
have therefore either overcounted or undercounted each zzz-tuple region by 1 , which is 
consistent with the above results for zzz = 2, 3, and 4. There are of the zzz-tuple 
regions, each of which has an area of p' n . So at each stage, we need to either subtract 
or add an area (probability) of ( ',l,)p' n ~ (n m /m\)p m = (np) m /zzz! = a m /m\. These 
are the terms in parentheses in Eq. (4.53). 


Remark: In the end, the solution to this problem consists of the reasoning in the re¬ 
mark in the solution to Problem Eq. (2.3), combined with the fact that if n is large, 
we can say that ~ (n m /m\)p m , which equals a m /m\. Now, taking into ac¬ 

count all of the above double (and triple, etc.) counting is of course a much more 
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laborious way to find P( 0) than simply using Eq. (4.40). Equivalently, the double¬ 
counting solution is more laborious than using Eq. (4.32) with k = 0, which quickly 
gives P( 0) = (1 - p) n « e~ pn = e~ a , using Eq. (7.14). (Eq. (4.32) is equivalent to 
Eq. (4.34), which led to Eq. (4.40).) The reasoning behind Eq. (4.32) involved directly 
finding the probability that zero events occur, by multiplying together all of the prob¬ 
abilities (1 — p) that each event doesn’t occur. This is clearly a much quicker method 
than the double-counting method of finding the probability that at least one event oc¬ 
curs. and then subtracting that from 1. This double-counting method is exactly the 
opposite of the helpful “art of not” strategy we discussed in Section 2.3.1! * 

4.20. Probability of at least 1 


(a) In finding the probability that at least one ball ends up in the given box, our 
strategy (in both parts of this problem) will be to find the probability that zero 
balls end up in the given box, and then subtract this probability from 1. The 
process at hand is approximately a Poisson process, just as the balls-in-boxes 
setup in the example on page 213 was. So from the Poisson distribution in 
Eq. (4.40), the probability that zero balls end up in the given box is P(0) = 
a°e~ a / 0! = e~ a . The probability that at least one ball ends up in the given box 
is then 1 - e~ a . This is an approximate result, because the process isn’t exactly 
Poisson. 

In the given setup, we have n = 10 6 balls and b = 10 9 boxes. So the average 
number of balls in a given box is a = n/b = 1/1000. Since this number is small, 
we can use the approximation in Eq. (7.9) (with x = -a) to write e~ a as 1 — a. 
The desired probability that at least one ball ends up in the given box is therefore 


1 - e 


—a 


1 - (1 - a) = a = 


1 

1000 ' 


(4.110) 


This makes sense. The expected number, a, of balls is small, which means that 
double (or triple, etc.) events are rare. The probability that at least one ball ends 
up in the given box is therefore essentially equal to P{ 1). Additionally, since 
double (or triple, etc.) events are rare, we have P(l) * a , because the expected 
number of balls can be written as a = P(l)- 1 +£//2/)-2 + - ■ ■ => P(l) ss a. The 
two preceding sentences tell us that the probability that at least one ball ends up 
in the given box is approximately equal to a, as desired. 

(b) The probability that a particular ball ends up in the given box is l/b, where 
b = 10 9 is the number of boxes. So the probability that the particular ball 
doesn’t end up in the given box is 1 - l/b. This holds for all n = 10 6 of the 
balls, so the probability that zero balls end up in the given box is (1 - 1 /b) n . 
(This is just Eq. (4.33) with k = 0.) The probability that at least one ball ends 
up in the given box is therefore 1 - (1 - l/b)' 1 . This is the exact answer. 

We can now use the (1 + a) n * e" a approximation in Eq. (7.14) to simplify the 
answer. (We’re using a in place of the a in Eq. (7.14), because we’ve already 
reserved the letter a for the average number of balls, n / b, here.) With a = -l/b, 
Eq. (7.14) turns the 1 - (1 - l/b)" probability into 

1 — (1 — 1 /b) n ~ l-e~ n/b = l-e~ a . (4.111) 


The e a ss 1 - a approximation then turns this into a , as in part (a). 

Remark: We have shown that for small a = n/b, the probability that at least 
one ball ends up in the given box is approximately a. This result of course 



246 


Chapter 4. Distributions 


doesn’t hold for non-small a because, for example, if we consider the a = 1 
case, there certainly isn't a probability of 1 that at least one ball ends up in the 
given box. And we would obtain a nonsensical probability larger than 1 if a > 1. 
From either Eq. (4.110) or Eq. (4.111), the correct probability (in the Poisson 
approximation) that at least one ball ends up in the given box is 1 - e~ a . For 
non-small a, we can’t use the e~ a ~ 1 - a approximation to turn 1 - e~ a into 
a. * 


4.21. Comparing probabilities 


(a) The three events are independent. So with p = 1/1000, the desired probability 
is simply p 3 , which equals 10 -9 . 

(b) The three trials of the process are independent, so the desired probability is 
again p 3 , where p = 1/1000 is the probability that exactly one ball lands in the 
given box in a given trial of the process. So we again obtain an answer of 10 -9 . 
This setup is basically the same as the setup in part (a). 

(c) If we perform a single trial of throwing a million balls into a billion boxes, the 
probability that three specific balls end up in the given box is (1/&) 3 (where 
b = 10 9 ), because each ball has a l/b chance of landing in the box. 6 There 
are ways to pick the three specific balls from the n = 10 6 balls, so the 

probability that exactly three balls end up in the box is (")/Z? 3 . We can simplify 
this result by making an approximation to the binomial coefficient. Using the 
fact that n - 1 and n — 2 are both essentially equal (multiplicatively) to n if n is 
large, we have 


/n\ 1 n(n-l)(n-2) 1 n 3 1 

\3)& = 3! ifi ~ 3! p 

1 /n\ 3 _ (1(T 3 ) 3 _ 10“ 9 
~ 3! \b) ” 3! ” 


(4.112) 


(d) The process in part (c) is approximately a Poisson process with a = n/b = 
1/000. The probability that exactly three balls end up in the given box is there¬ 
fore given by Eq. (4.40) as 

cr'e~ a 

P(3) = ——— . (4.113) 


Since a = 1/000 is small, the e a factor is essentially equal to 1, so we can 
ignore it. We therefore end up with 


P(3) ~ — 
3! 


(10~ 3 ) 3 

3! 


io - 9 

3! ’ 


(4.114) 


in agreement with the result in part (c). 

In all of the parts to this problem, there is of course nothing special about the 
number 3 in the statement of the problem. If 3 is replaced by a general number 
k, then the results in parts (c) and (d) simply involve k\ instead of 3!. (Well, 
technically k needs to be small compared with n, but that isn’t much of a re¬ 
striction in the present setup with n = 10 6 .) 


6 There is technically a nonzero probability that other balls also land in the box. But this probability 
is negligible, so we don't have to worry about subtracting it off, even though we want exactly three balls 
in the box. Equivalently, the binomial distribution also involves a factor of (1 - 1 /£>)” 3 (which ensures 
that the other n — 3 balls don’t land in the box), but this factor is essentially equal to 1 in the present 
setup. 
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(e) The result in part (c) is smaller than the result in part (b) by a factor of 1/3! = 
1/6. Let’s explain intuitively why this is the case. 

In comparing the setups in parts (b) and (c), let’s compare the respective prob¬ 
abilities (labeled p^ and p^ c) ) that three specific balls (labeled A, B. and C) 
end up in the given box. Although we solved part (b) in a quicker manner (by 
simply cubing p), we’ll need to solve it here in the same way that we solved 
part (c), in order to compare the two setups. Note that in comparing the setups, 
it suffices to compare the probabilities for three specific balls, because both se¬ 
tups involve the same number of groups of three specific balls, namely . So 

the total probabilities in each case are and ("jp^, with the factor 

being common to both. 

Consider first the setup in part (c), with the single trial. There is only one way 
that all three of A. B, and C can end up in the box: If you successively throw 
down the n balls, then when you get to ball A, it must end up in the box (which 
happens with probability 1 /b); and then when you get to ball B, it must also end 
up in the box (which again happens with probability 1 /b); and finally when you 
get to ball C, it must also end up in the box (which again happens with prob¬ 
ability 1 /b). The probability that all three balls end up in the box is therefore 
pf = (1 /b) 3 . (This is just a repeat of the reasoning we used in part (c).) 

Now consider the setup in part (b), with the three trials. There are now six ways 
that the three balls can end up in the box, because there are 3! permutations of 
the three balls. Ball A can end up in the box in the first of the three trials of n 
balls (which happens with probability l/b), and then B can end up in the box 
in the second trial (which again happens with probability l/b), and then C can 
end up in the box in the third trial (which again happens with probability 1 /b). 
We’ll label this scenario as ABC. But the order in which the balls go into the 
boxes in the three successive trials can take five other permutations too, namely 
ACB, BAC, BCA. CAB, CBA. Each of the six possible permutations occurs 
with probability (1/Z>) 3 , so the probability that all three balls (A, B, and C) end 
up in the box equals p * b) = 6(l//>) 3 . This explains why the answer to part (b) is 
six times the answer to part (b). 

As mentioned above, if we want to determine the total probabilities in each 
setup, we just need to multiply each of p ^ b) and p!/ 1 by the number as n 3 /3! 
of groups of three balls. This was our strategy in part (c), and the result was 
(n/fe) 3 /3!. In part (b) this gives (n 3 /3!)(6/fe 3 ) = (rc/fc) 3 = p 3 , in agreement 
with our original (quicker) solution. Note that it isn't an extra factor of 3! in the 
denominator that makes the answer to part (c) be smaller; parts (b) and (c) both 
have the 3! arising from the binomial coefficient. Rather, the answer to part 
(c) is smaller because it doesn ’t have the extra 3! in the numerator arising from 
the different permutations. 

Remark: Alternatively, you can think in terms of probabilities instead of per¬ 
mutations. In part (c) the probability (as we noted above) that three specific balls 
end up in the box is {l/b)(l/b)(l/b), because each of the three balls must end 
up in the box when you throw it down. In contrast, in part (b) the probability 
that three specific balls end up in the box is (3/b)(2/b)(l/b), because in the first 
trial of n balls, any of the three specific balls can end up in the box. And then in 
the second trial, one of the two other balls must end up in the box. And finally 
in the third trial, the remaining one of the three balls must end up in the box. 
The probability in part (b) is therefore larger by a factor of 3! =6. 
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Intuitively, it makes sense that the probability in part (b) is larger, because in 
part (c) if ball A doesn't end up in the box when you throw it down, you are 
guaranteed failure (for the three specific balls A, B, and C). But in part (b) if 
ball A doesn’t end up in the box in the first of the three trials of n balls, you still 
have two more chances (with balls B and C) in that trial to get a ball in the box. 
So you have three chances to put one of the balls in the box in the first trial. And 
likewise you have two chances in the second trial. * 


4.22. Area under a Gaussian curve 

Let I be the desired integral. Then following the hint, we have 



If we convert from Cartesian to polar coordinates, then x 2 + y 2 becomes r 2 (by the 
Pythagorean theorem), and the area element dx dy in the plane becomes r dr dO. This 
expression follows from the fact that we can imagine covering the plane with infinitesi¬ 
mal rectangles with sides of length dr in the radial direction and r dO (the general form 
of an arclength) in the tangential direction. 

The original double Cartesian integral runs over the entire x-y plane, so the new double 
polar integral must also run over the entire plane. The polar limits of integration are 
therefore 0 to oo for r, and 0 to 2 n for 6. The above integral then becomes 

u r2n poo 

I 2 = — I I e~ br rdrdO. (4.116) 

n Jo Jo 

The 6 integral simply gives 2n. The indefinite r integral is -e~ br ~ /2b, as you can 
verify by differentiating this. The factor of r in the area element is what makes this 
integral doable, unlike the original Cartesian integral. We therefore have 


9 b 

7" = — • 2n ■ 


-br l 


2b 




n 

= 1 . 


(4.117) 


So 7 = VT = 1 , as desired. Note that if we didn’t have the factor of sfbfn in the 
distribution, we would have ended up with 



(4.118) 


This is a useful general result. 

The above change-of-coordinates trick works if we're integrating over a circular region 
centered at the origin. (An infinitely large circle covering the entire plane falls into 
this category.) If we want to calculate the area under a Gaussian curve with the limits 
of the x integral being arbitrary finite numbers a and b, then our only option is to 
evaluate the integral numerically. (The change-of-coordinates trick doesn't help with 
the rectangular region that arises in this case.) For example, if we want the limits to be 
±cr = ±1/ x[2b, then we must resort to numerics to show that the area is approximately 
68% of the total area. 



4.12. Solutions 


249 


4.23. Variance of the Gaussian distribution 

First solution: With /r = 0, the variance of the second expression for f{x) in 
Eq. (4.42) is 


E{X 2 ) = J' x 2 f(x)dx = ^ 1 2 J' x 2 e x ^ 2cr dx. (4.119) 

We can evaluate this integral by using integration by parts. That is, f fg' = fg — 
f fg. If we write the x 2 factor as x ■ x, then with f = x and g' = xe~ x ! 2cr , we can 
integrate g' to obtain g = -cr 2 e~ x 1 2cr . So we have 


f * • xe^'^dx = x-{- rrV* 2 / 2 - 2 ) 

OO 

r»oo 

!• 

{-cfe-^^dx 

1-00 V ' 

— 00 

/ — OO 


= 0 + cr 2 J e~ x /2cr ' 

—OO 

dx. 


(4.120) 


_ 2 

The 0 comes from the fact that the smallness of e °° wins out over the largeness of 
the factor of oo out front. The remaining integral can be evaluated by invoking the 
general result in Eq. (4.118). With b = 1/2 tr 2 the integral is finer 2 . So Eq. (4.120) 
gives 



(4.121) 


Plugging this into Eq. (4.119) then gives 


E(X 2 ) 



■ a 2 sjlncr 2 = a 2 . 


(4.122) 


as desired. 


Second solution: This solution involves a handy trick for calculating integrals of the 
form x 2n e~ bx ~ dx. Using the e~ bx " dx = fnb~ 1 ^ 2 result from Eq. (4.118) 

and successively differentiating both sides with respect to b, we obtain 



-bx 2 


-bx 2 


-bx 2 


dx = fnb */ 2 , 

dx = - fnb ~ 3 ^ 2 , 
2 

dx = - fnb ~ 5 / 2 , 


(4.123) 


and so on. On the lefthand side, it is indeed legal to differentiate the integrand (the 
expression inside the integral) with respect to b. If you have your doubts about this, 
you can imagine writing the integral as a sum over, say, a million terms. It is then 
certainly legal to differentiate each of the million terms with respect to b. In short, the 
derivative of the sum is the sum of the derivatives. 

The second line in Eq. (4.123) is exactly the integral we need when calculating the 
variance. With b = 1/2<t 2 , the second line gives 

f°° x 2 e~ x2/2cr2 dx = I Vi (^2) 3/2 = V27rcr 3 , (4.124) 


in agreement with Eq. (4.121). 



Chapter 5 

Gaussian approximations 


In this chapter we will concentrate on three of the distributions we studied in Chap¬ 
ter 4, namely the binomial, Poisson, and Gaussian distributions. In Section 5.1 we 
show how a binomial distribution reduces to a Gaussian distribution when the num¬ 
bers involved are large. Section 5.2 covers the law of large numbers, which says 
that in a very large number of trials, the observed fraction of events will be very 
close to the theoretical probability. In Section 5.3 we show how a Poisson distri¬ 
bution reduces to a Gaussian distribution when the numbers involved are large. In 
Section 5.4 we tie everything together. This leads us in Section 5.5 to the central 
limit theorem, which is the statement that no matter what distribution you start with, 
the sum (or average) of the outcomes of many trials will be approximately Gaus¬ 
sian. As in Chapter 4, parts of this chapter are a bit mathematical, but there’s no 
way around this if we want to do things properly. We will invoke some results from 
Appendix C. 


5.1 Binomial and Gaussian 

In Section 4.5 we discussed the binomial distribution, in particular the binomial 
distribution that arises from a series of coin flips. The probability distribution for 
the total number of Heads in, say, 30 flips takes the form of the left plot in Fig. 4.10. 
The shape of this plot looks suspiciously similar to the shape of the Gaussian plot 
in Fig. 4.25, so you might wonder if the binomial distribution is actually a Gaussian 
distribution (or more precisely, if the discrete binomial points lie on a continuous 
Gaussian curve). It turns out that for small numbers of coin flips, this isn’t quite 
true. But for large numbers of flips, a binomial distribution takes essentially the 
form of a Gaussian distribution. The larger the number of flips, the closer it comes 
to a Gaussian. 

For three different numbers of coin flips (2, 6, and 20), Fig. 5.1 shows the com¬ 
parison between the exact binomial distribution (the dots) and the Gaussian approx¬ 
imation (the curve), which we’ll derive below in Eq. (5.13). The coordinate on the x 
axis is the number of Heads relative to the expected value (which is half the number 
of flips). So for n flips, the possible x values range from —n /2 to n/2. The Gaussian 
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Figure 5.1: Comparison of the binomial distribution and the Gaussian approximation, for 
various numbers of coin flips, x is the number of Heads relative to the expected number. 


approximation is clearly very good for 20 flips. And it gets even better for larger 
numbers of flips. 

We will now demonstrate why the binomial distribution takes essentially the 
form of a Gaussian distribution when the number of flips is large. For convenience, 
we’ll let the number of flips be 2 n, just to keep some factors of 1/2 from cluttering 
things up. We will assume that n is large. 

We’ll need two bits of mathematical machinery for this derivation. The first is 
Stirling’s formula, which we introduced in Section 2.6. It says that if n is large, then 
n ! is approximately given by 


n\ ~ n n e " V27 rn. (5.1) 

It’s a good idea at this point to go back and review Section 2.6. The second thing 
we’ll need is the approximation in Eq. (7.15) in Appendix C: 

(1 +a) m *e ma e- ma2/2 . (5.2) 

We’re using in instead of n here, because we’ve already reserved n for half the 
number of flips. You are encouraged to read Appendix C at this point (after reading 
Appendix B), to see where this approximation comes from. However, feel free to 
just accept it for now if you want. But in that case, you should at least verify with a 
calculator that it works fairly well for, say, a = 0.1 and m = 30. 

The following derivation is a bit mathematical, but the result (that a binomial 
distribution can be approximated by a Gaussian distribution) is well worth it. We’ll 
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demonstrate this result just for coin flips (that is, a binomial distribution with p = 
1 /2), but it actually holds for any p\ see the discussion following the remarks below. 

We’ll start with the binomial distribution in Eq. (4.8), which gives the probabil¬ 
ity of obtaining k Heads in n coin flips. However, since we’re letting the number of 
coin flips be 2 n here, the n in Eq. (4.8) gets replaced by 2 n. Also, let’s replace k by 
n + x, which just means that we’re defining x to be the number of Heads relative to 
the expected number (which is n). Writing the number of Heads as n + x will make 
our calculations much simpler than if we had stuck with k. With these adjustments, 
Eq. (4.8) becomes (with the subscript B for binomial) 

^bW = if 2U ) (for 2n coin flips) (5.3) 

2 zn \n + x 


We will now show that if n is large, P\\(x) takes the approximate form, 

e -* 2 /;i 

Pb(x) * — =■, (5.4) 

sjnn 

which is the desired Gaussian. This takes the same form as the first Gaussian ex¬ 
pression in Eq. (4.42), with h = 1/n and p — 0. 

So here we go - get ready for some math! But it’s nice math, in the sense that 
a huge messy equation will undergo massive cancelations and yield a nice simple 
result. The first step is to use Stirling’s approximation to rewrite each of the three 
factorials in the binomial coefficient in Eq. (5.3). This gives 


2 n \ _ (2n)\ 

n + x) (n + x)\(n - x)! 

(2m) 2 " e -2 " V2 tt(2m) 

[(m + x) n+x e~( n+x ) xj2n(n + x) ] ■ [(n - x) n ~ x sj2n(n - x) ] 


(5.5) 


Canceling all the e’s and a few other factors gives 

12n \ (2m) 2 " sfn 

\ n + x ) (m + x) n+x (n — x) n ~ x sfn^/n 2 — x 2 


(5.6) 


Let’s now divide both the numerator and denominator by n 2 ". In the denominator, 
we’ll do this by dividing the first and second factors by n n+x and n"~ x , respectively. 
The result is 


2m \ 2 ln V77 

(l+ri 1 


(5.7) 


It’s now time to apply the approximation in Eq. (5.2). With the a and m in that 
relation defined to be a = x/n and in = n + x, we have (using the notation exp(y) 
for e y , to avoid writing lengthy exponents) 



~ exp 




1 

2 


(m + x) 



(5.8) 
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When we multiply things out here, we find that there is a -x 3 /2n 2 term. However, 
we’ll see below that the x’s we’ll be dealing with are much smaller than n, which 
means that the -x 3 /2 n 2 term is much smaller than the other terms. So we’ll ignore 
it. We are then left with 



(5.9) 


Although the x 2 /2 n term here is much smaller than the x term (assuming x <sc n), 
we will in fact need to keep it, because the x term will cancel in Eq. (5.11) below. 
(The -x 3 /2 n 2 term would actually cancel too, for the same reason.) In a similar 
manner, we obtain 



(5.10) 


Using these results in Eq. (5.7), we find 


' 2n \ 2 2n sfR 

, n + x l exp (x +§^) exp ( - x + sJti a In 2 - x 2 


When combining (adding) the exponents, the x and -x cancel. Also, under the 
assumption that x «: n, we can say that Vn 2 - x 2 « Vn 2 - 0 = n. (As with any 
approximation claim, if you don’t trust this, you can simply plug in some numbers 
and see how well it works. For example, you can let n — 10,000 and x = 100, 
which satisfy the x «: n relation.) Eq. (5.11) then becomes 

( 2 "U^L (5.12) 

\n + x) e x /" sJH n 

Finally, if we substitute Eq. (5.12) into Eq. (5.3), the 2 2 " factors cancel, and we are 
left with the desired result (with the subscript G for Gaussian), 


g-x 2 /n 

Pb(x) ~ —— S Pg(x) 
V nn 


(for 2 n coin flips) 


(5.13) 


This is the probability of obtaining n+x Heads in 2 n coin flips. If we want to switch 
back to having the number of flips be n instead of 2n, then we just need to replace 
n with n/2 in Eq. (5.13). The result is (with x now being the deviation from n/2 
Heads) 


Pb(x) * 


e ~2x 2 /n 

sfnn]2 


= Pq(x ) 


(for n coin flips) 


(5.14) 


Whether you use Eq. (5.13) or Eq. (5.14), the coefficient of n and the inverse of the 
coefficient of x 2 are both equal to half the number of flips. 

If you want to write the above results in terms of the actual number k of Heads, 
instead of the number x of Heads relative to the expected number, you can just 
replace x with either k — n in Eq. (5.13), or k — n/2 in Eq. (5.14). 
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The most important part of the above results is the n in the denominator of the 
exponent, because this determines the width of the distribution. We’ll talk about 
this in Section 5.2, but first some remarks. 

Remarks: 

1. In the above derivation, we claimed that if n is large (as we are assuming), then any 
values of x that we are concerned with are much smaller than n. This allowed us to 
simplify various expressions by ignoring certain terms. Let’s be explicit about how 
the logic of the x <K n assumption proceeds. 

What we showed above (assuming n is large) is that if the x <K n condition is satisfied, 
then Eq. (5.13) is valid. And the fact of the matter is that if n is large, we’ll never be 
interested in values of x that don't satisfy x <K n (and hence for which Eq. (5.13) might 
not be valid), because the associated probabilities are negligible. This is true because 
if, for example, x = 10 sfn (which certainly satisfies x <K n if n is large, which means 
that Eq. (5.13) is indeed valid), then the e~ x exponential factor in Eq. (5.13) equals 
e~ i0 = e -100 ss 4- 1(L 44 , which is completely negligible. (Even if x is only 2 sfn, the 
e~ x /" factor equals e~~ = e~ 4 ss 0.02.) Larger values of x will yield even smaller 
probabilities, because we know that the binomial coefficient in Eq. (5.3) decreases as 
x gets farther from zero; recall Pascal's triangle in Section 1.8.1. These probabilities 
might not satisfy Eq. (5.13), but we don't care, because they're so small. 

2. In the terminology of Eq. (5.14) where the number of coin flips is n, the plots in 
Fig. 5.1 correspond to n equalling 2, 6, and 20. So in the third plot, for example, the 
continuous curve is a plot of Pq{x ) = c - *”/ 10 / V 10zr. 

3. Pg(x) is an even function of x. That is, x and —x yield the same value of the function; 
it is symmetric around x = 0. This is true because x appears only through its square. 
This evenness makes intuitive sense, because we’re just as likely to get, say, four 
Heads above the average as four Heads below the average. 

4. We saw in Eq. (2.66) in Section 2.6 that the probability that exactly half (that is, n) of 
2 n coin flips come up Heads equals 1/ sfrrn. This result is a special case of the Pq(x) 
result in Eq. (5.13), because if we plug x = 0 (which corresponds to n Heads) into 
Eq. (5.13), we obtain Pq(x) = e~ 0 /sfixn = 1/ sfjm. 

5. Note that we really did need the factor in the approximation in Eq. (5.2). If 

we had used the less accurate version, (1 + a) m as e ma from Eq. (7.14) in Appendix C, 
we would have had incorrect x~/n terms in Eqs. (5.9) and (5.10), instead of the correct 
x 2 /2 n terms. 

6. If we compare the Gaussian result in Eq. (5.14) with the second of the Gaussian ex¬ 
pressions in Eq. (4.42), we see that they agree if cr = sJn/4. This correspondence 
makes both the prefactor and the coefficient of x 2 in the exponent agree. The standard 
deviation of our Gaussian approximation in Eq. (5.14) (for the binomial distribution 
for n coin flips) is therefore cr = s/n/4. This agrees (as it must) with the exact bino¬ 
mial standard deviation we obtained in Eq. (3.48). 

Before going through the above derivation, it certainly wasn’t obvious that a bino¬ 
mial should reduce to a Gaussian when n is large. However, the previous paragraph 
shows that if it reduces to a Gaussian, then the n’ s must appear exactly as they do in 
Eq. (5.14), because we know that the standard deviation (which is the cr in Eq. (4.42)) 
must agree with the sin/4 value that we already found in Eq. (3.48). 

7. Since the area (probability) under the Gaussian distribution in Eq. (4.42) is 1 (see 
Problem 4.22). and since Eq. (5.14) takes the same form as Eq. (4.42), the area un¬ 
der the distribution in Eq. (5.14) must likewise be 1. Of course, we already knew 
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this, because Eq. (5.14) is an approximation to the binomial distribution, whose total 
probability is 1. * 


If the two probabilities involved in a binomial distribution are p and 1 -p instead 
of the two 1/2’s in the case of a coin toss, then the probability of k successes in n 
trials is given in Eq. (4.6) as P(k) = (j)p k (l - p) n ~ k . (We’ve gone back to using n 
to represent the total number of trials, instead of the 2 n we used in Eq. (5.3).) For 
example, if we’re concerned with the number of 5’s we obtain in n rolls of a die, 
then p = 1 /6. 

It turns out that for large n, the binomial distribution P(k) is essentially a Gaus¬ 
sian distribution for any value of p , not just the p — 1/2 value we discussed above. 
The Gaussian is centered around the expected value of k (namely pn), as you would 
expect. The derivation of this Gaussian form follows the same steps as above. But 
it gets rather messy, so we’ll just state the result: For large n, the probability of 
obtaining k = pn + x successes in n trials is approximately equal to 


Pg(x) ~ 


e ~x 2 /[2np(l-p)] 

^2nnp(\ - p) 


(for n biased coin flips) 


(5.15) 


If p — 1/2, this reduces to the result in Eq. (5.14), as it should. 

Eq. (5.15) implies that the bump in the plot of Pg(x) is symmetric around x = 0 
(or equivalently, around k = pn) for any p, not just p = 1/2. This isn’t so obvious, 
because for p + 1/2, the bump isn’t centered around «/2. That is, the location of 
the bump is lopsided with respect to n/2. So you might think that the shape of the 
bump should be lopsided too. But it isn’t. (Well, the tail extends farther to one side, 
but Pq{x) is essentially zero in the tails.) Fig. 5.2 shows a plot of Eq. (5.15) for 
p = 1/6 and n = 60, which corresponds to rolling a die 60 times and seeing how 
many, say, 5’s you get. The x — 0 point corresponds to having pn = (l/6)(60) = 10 
rolls of a 5. The bump is quite symmetric (although technically not exactly). This is 
consistent with what we noted about the binomial distribution in the remark at the 
end of the example in Section 3.4. 


Pq(x) 



Figure 5.2: The probability distribution for the number of 5’s in 60 dice rolls, x is the 
deviation from the expected number (which is 10). The bump in the distribution is essentially 
symmetric. 

As in the sixth remark above, if we compare the Gaussian result in Eq. (5.15) 
with the second of the Gaussian expressions in Eq. (4.42), we see that they agree 
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if cr = ^jnp{ 1 - p) = \Jnpq. Again, this correspondence makes both the prefac¬ 
tor and the coefficient of x 2 in the exponent agree. The standard deviation of our 
Gaussian approximation in Eq. (5.15) (for the binomial distribution for n biased 
coin flips) is therefore cr = yjnpq. This agrees (as it must) with the exact binomial 
standard deviation we obtained in Eq. (3.47). 

As we also noted in the sixth remark, if someone claims (correctly) that a general 
binomial distribution involving probability p reduces to a Gaussian, then they must 
also claim that the np( 1 —p) factors appear exactly as they do in Eq. (5.15), because 
we know that the standard deviation (which is the cr in Eq. (4.42)) must agree with 
the sjnp( 1 — p) = yjnpq value that we already found in Eq. (3.47). 


5.2 The law of large numbers 

The law of large numbers is, in a sense, the law that makes the subject of probability 
a useful one, in that it allows us to make meaningful predictive statements about 
future outcomes. The law can be stated in various ways, but we’ll go with: 

• Law of large numbers: 

If you repeat a random process a very large number of times, then the ob- 
seri’ed fraction of times that a certain event occurs will be very close to the 
theoretical probability. 

More precisely, consider the probability, p L \ (with the “d” for “differ”), that the 
observed fraction differs from the theoretical probability by more than a specified 
small number, say 6 - 0.01 or 0.001. Then the law of large numbers says that p$ 
goes to zero as the number of trials becomes large. Said in a more down-to-earth 
way, if you perform enough trials, the observed fraction will be pretty much what it 
“should" be. 

Remark: The probability p & in the preceding paragraph deals with the results of a large 
number (call it n\) of trials of a given random process (such as a coin flip). If you want to 
experimentally measure p^, then you need to perform a large number (call it nf) of sets, each 
of which consists of a large number n\ of coin flips (or whatever). For example, we might 
be concerned with the fraction of Heads that show up in n\ - 10,000 coin flips. If we ask 
for the probability p t j that this fraction differs from 50% by more than 1%, then we could 
do, say, m = 100,000 sets of n\ = 10,000 flips (which means a billion flips in all!) and 
then make a list or a histogram of the resulting ri 2 observed fractions. The fraction of these 
fractions that are smaller than 49% or larger than 51% is our desired probability p&. The 
larger «2 is, the closer our result for p& will be to its true value (which happens to be 5%; see 
Problem 5.3). This is how you experimentally determine p j. The law of large numbers says 
that if you make n\ larger and larger (which means that you need to make nj larger too), then 
Pd approaches zero. * 

The clause “a very large number of times” is critical in the law. If you flip a coin 
only, say, 10 times, then you of course cannot be nearly certain that you will obtain 
Heads half (or very close to half) of the time. In fact, the probability of obtaining 
exactly five Heads is only ^ 1 5 °^/2 1<) « 25%. 

You will note that the above statement of the law of large numbers is essentially 
the same as the definition of probability presented at the beginning of Section 2.1. 
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Things therefore seem a bit circular. Is the law of large numbers a theorem or a 
definition? This problem can be (somewhat) remedied by stating the law as, “If you 
repeat a random process a very large number of times, then the observed fraction of 
times that a certain event occurs will approach a definite value.” Then, given that 
a definite value is approached, we can define that value to be the probability. The 
law of large numbers is therefore what allows probability to be well defined. (If this 
procedure doesn’t allay your concerns about circularity, rest assured, it shouldn’t. 
See the “On average” subsection in Appendix A for some discussion of this.) 

We won’t give a formal proof of the law, but we’ll look at a coin-flipping setup 
in detail. This should convince you of the truth of the law. We’ll basically do the 
same type of analysis here that we did in Section 3.4, where we discussed the stan¬ 
dard deviation of the mean. But now we’ll work with Gaussian distributions, in 
particular the one in Eq. (5.13), where the number of flips is 2 n. Comparing the 
Gaussian expression Pc,(x) in Eq. (5.13) with the second of the Gaussian expres¬ 
sions in Eq. (4.42), we see that the standard deviation when the number of flips is 
In equals cr = sJn/2. This is consistent with the fact that the standard deviation 
when the number of flips is n equals cr = s/n/4. 

Fig. 5.3 shows plots of Pg(x) for n = 10, 100, and 1000. So the numbers of coin 
flips are 20, 200, and 2000. As n gets larger, the curve’s height shrinks, because 
Eq. (5.13) says that the height is proportional to 1/ yfn. And the width expands, 
because cr is proportional to \fn. Because these two factors are reciprocals of each 
other, this combination of shrinking and expanding doesn’t change the area under 
the curve. This is consistent with the fact that the area is always equal to 1. 
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Figure 5.3: Illustration of how the Gaussian distribution in Eq. (5.13) depends on n. The 
number of coin flips is 2 n, and x is the deviation from the expected number of Heads (which 
is n). 

The critical fact about the \Jn expansion factor in the width is that although it 
increases as n increases, it doesn ’t increase as fast as n does. In fact, compared with 
n, it actually decreases by a factor of 1/ \JTi. This means that if we plot Pc,(x) with 
the horizontal axis running from —n to n (instead of it being fixed as in Fig. 5.3), 
then the width of the curve actually shrinks by a factor of 1 /sJTi (relative to n). 
Fig. 5.4 shows this effect. In this figure, both the width (relative to n) and the height 
of the curves are proportional to 1/ sJTi (the height behaves the same as in Fig. 5.3), 
so all of the curves have the same shape. They just have different sizes; they differ 
successively by a factor of 1/ VlO. The area under each curve is still equal to 1, 
though, because of the different scales on the x axis. 
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n = 10 


« = 100 


n = \ 000 


(Note differerent scales on x axis) 


Figure 5.4: Repeat of the curves in Fig. 5.3, but now with the full range of possible values 
on the x axis. 


A slightly more informative curve to plot is the ratio of P(,(x) to its maximum 
height at x = 0. This modified plot makes it easier to see what’s happening with 
the width. Since the maximum value of the Gaussian distribution in Eq. (5.13) is 
1 / yfmi, we’re now just plotting e~ x . So all of the curves have the same value 
of 1 at a: = 0. If we let the horizontal axis run from —n to n as in Fig. 5.4, we 
obtain the plots shown in Fig. 5.5. These are simply the plots in Fig. 5.4, except that 
they’re stretched in the vertical direction so that they all have the same height. We 
see that the bump gets thinner and thinner (on the scale of n) as n increases. (Each 
successive bump is thinner by a factor 1/ VTO.) This implies that the percentage 
deviation from the average of n Heads gets smaller and smaller as n increases. 
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Figure 5.5: Repeat of the curves in Fig. 5.4, measured relative to the maximum height. 


We can now understand why the law of large numbers holds. Equivalently (for 
the case of coin flips), we can now understand the reason behind the claim we made 
at the end of Section 2.1, when we said that the observed fraction of Heads gets 
closer and closer to the actual probability of 1/2, as the number of trials gets larger 
and larger. We stated that if you flip a coin 100 times (which corresponds to n — 50 
here), the probability of obtaining 49, 50, or 51 Heads is only about 24%. This is 
roughly 3 times the 8% result in Eq. (2.67), because the probabilities for 49 and 51 
are roughly the same as for 50. 

This probability of 24% is consistent with the first plot in Fig. 5.6, where we’ve 
indicated the 49% and 51% tick marks (which correspond to x = ±1) on the x axis. 
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If we make a histogram of the probabilities (in order to interpret the probability as 
an area), then the natural thing to do is to have the “bin” for 49 go from 48.5 to 
49.5, etc. So if we’re looking at 49, 50, or 51 Heads, we’re actually concerned with 
(approximately) the area between 48.5 and 51.5. This is the shaded area shown, 
with a width of 3. (This shaded area might not look like it is 24% of the total area, 
but it really is!) The distinction between the tick marks and the shaded area (which 
extends 0.5 beyond the tick marks) matters in the present n = 50 case, but it is 
inconsequential when n is large, because the distribution is effectively continuous. 
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Figure 5.6: Illustration of the law of large numbers. If the number of coin flips is very large 
(as it is in the second plot), then the percentage of Heads is nearly certain to be very close to 
50%. 


At the end of Section 2.1, we also stated that if you flip a coin 100,000 times 
(which corresponds to n = 50,000), the probability of obtaining Heads between 
49% and 51% of the time is 99.999999975%. This is consistent with the second 
plot in Fig. 5.6, because essentially all of the area under the curve lies between the 
49% and 51% marks (which correspond to x = ±1000). The standard deviation for 
n = 50,000 is a/h/ 2 = V25,000 = 158. So the 51% mark corresponds to about 
six standard deviations from the mean. There is virtually no chance of obtaining a 
result more than 6<x from the mean. In contrast, in the case with n — 50, the standard 
deviation is xfnj 2 = V25 = 5. Most of the area under the curve lies outside the 
49% and 51% marks (where x = ±1), or rather, the 48.5% and 51.5% marks. 

The law of large numbers states that if p d is the probability that the observed 
fraction differs from the theoretical probability by more than a specified small num¬ 
ber <5, then p& goes to zero as the number of trials becomes large. The right plot 
in Fig. 5.6 demonstrates this for 6 = 1% = 0.01 and 100,000 flips. From the 
99.999999975% probability mentioned above, there is only a p L i = 0.000000025% 
probability of ending up outside the 49%-51% range. Although we’ve demon¬ 
strated the law of large numbers only in the case of coin flips (a binomial process 
with p = 1/2), it holds for any random process that is performed a large number of 
times. 

The law of large numbers is an extremely important result, and it all comes down 
to the fact that although the standard deviation of our Gaussian coin-flip distribution 
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grows with n (it is cr — yfnjl for 2 n flips), it grows only like the square root of n, 
so it shrinks in comparison with the full spread of outcomes (which is 2 n). Said 
a different way, although the width of the distribution grows in an additive sense 
(this is sometimes called an “absolute” sense), it decreases in a multiplicative sense 
(compared with n). It is the latter of these effects that is relevant when calculating 
percentages. 

This is exactly the same observation that we made back in Section 3.4 when we 
discussed the standard deviation of the mean. This makes sense, of course, because 
the percentage of Heads we’ve been talking about here is exactly the same thing 
as the average number of Heads per flip that we talked about in Section 3.4. So 
technically everything in this section is just a repeat of what we did Section 3.4. But 
it never hurts to see something twice! 

The law of large numbers is what makes polls more accurate if more people are 
interviewed, and why casinos nearly always come out ahead. It is what makes it 
prohibitively unlikely for all of the air molecules in a room to end up on one side of 
the room, and why a piece of paper on your desk doesn’t spontaneously combust. 
The list of applications is essentially endless, and it would be an understatement to 
say that the world would be a very different place without the law of large numbers. 


5.3 Poisson and Gaussian 

We showed in Section 5.1 that the binomial distribution in Eq. (5.3) becomes the 
Gaussian distribution in Eq. (5.13) in the limit where the number of trials is large. 
We will now show that the Poisson distribution in Eq. (4.40) becomes a Gaussian 
distribution in the limit of large a , where a is the expected number of successes in a 
given interval (of time, space, or whatever). 

Note that it wouldn’t make sense to take the limit of a large number of trials 
here, as we did in the binomial case, because the number of trials isn’t specified in 
the Poisson distribution. The only parameter that appears is the expected number 
of successes, a. However, in the binomial case, a large number n of trials implies 
a large expected number of successes (because the expected number pn grows with 
n). So the large-a limit in the Poisson case is analogous to the large-n limit in the 
binomial case. 

As in the binomial case, we will need to use the two approximations in Eqs. (5.1) 
and (5.2). Applying Stirling’s formula to the k\ in Eq. (4.40) gives (with the sub¬ 
script P for Poisson) 


Pp(k) = 


a k e~ a 

k\ 


a k e~ a 


k k e k yflttk 


(5.16) 


The result in Problem 4.10 is that the maximum of Pp(k) occurs at a (or tech¬ 
nically between a - 1 and a, but for large a this distinction is inconsequential). So 
let’s see how Pp(k) behaves near k = a. To this end, we’ll define x by k = a + x. 
So x is the number of successes relative to the average, a. This is analogous to the 
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k = n + x definition we used in Section 5.1. As it did there, working with x here 
will make our calculations much simpler. In terms of x, Eq. (5.16) becomes 


Pp(x) * 


a a+x e~ a 

(a + x) a+x e~ a ~ x sJ2n(a + x) 


(5.17) 


We can cancel a factor of e~ a . And we can divide both the numerator and denomi¬ 
nator by a a+x . Furthermore, we can ignore the x in the square root, because we’ll 
find below that the x’s we’re concerned with are small compared with a. The result 
is 


Pp(x) * 


1 


(> 


y/2na 


(5.18) 


It’s now time to use the approximation in Eq. (5.2). With the a in Eq. (5.2) defined 
to be x/a here, and with the m defined to be a + x, Eq. (5.2) gives 


(l + * exp|(a + *)(^) - + x )(“) j • (5-19) 


Multiplying this out and ignoring the small -x 3 /2 a 2 term (because we’ll find below 
that x <s a ), we obtain 



a+x 


» exp 




(5.20) 


This is just Eq. (5.9) with n —> a. Substituting Eq. (5.20) into Eq. (5.18) gives 

1 


P P (x) * 


e x e xl G a e x sj2na 


(5.21) 


which simplifies to 


e ~x 2 l2a 

Pp(x) ~ = Pg(x ) 

V2 na 


(5.22) 


This is the desired Gaussian. If you want to write this result in terms of the actual 
number k of successes, instead of the number x of successes relative to the average, 
then the definition k = a + x gives x — k — a, so we have 


„-(k-a) 2 /2a 

Pp(k) *-= Pa(k) 

V2 na 


(5.23) 


As we noted in the last remark in Section 4.8, the Poisson distribution (and hence 
the Gaussian approximation to it) depends on only one parameter, a. And as with the 
Gaussian approximation to the binomial distribution, the Gaussian approximation to 
the Poisson distribution is symmetric around x = 0 (equivalently, k — a). 

Fig. 5.7 shows a comparison between the exact Pp(k) function in the first line of 
Eq. (5.16), and the approximate Pa{k) function in Eq. (5.23). The approximation 
works quite well for a — 20 and extremely well for a = 100; the curve is barely 
noticeable behind the dots. 
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P(k) P{k) P(k) 



(Note differerent scales on axes) 
Dots = exact Poisson 
Solid curve = approximate Gaussian 


Figure 5.7: Comparison of the Poisson distribution in the first line of Eq. (5.16), and the 
Gaussian approximation in Eq. (5.23), for different values of a. 


If we compare the Gaussian distribution in Eq. (5.23) with the second expression 
in Eq. (4.42), we see that the Gaussian is centered at p — a (of course) and that the 
standard deviation is <x = \[7i. Again, since the Poisson distribution depends on 
only the one parameter a, we already knew that the standard deviation has to be a 
function of a. But it takes some work to show that it equals s/a. Of course, as in 
the sixth remark in Section 5.1, we know that if the Poisson distribution reduces to 
a Gaussian, then the a’ s must appear exactly as they do in Eq. (5.22), because we 
know that the standard deviation (which is the cr in Eq. (4.42)) must agree with the 
sfa value that we already found in Problem 4.13. 

Note that although sfa grows with a , it doesn’t grow as fast as a itself. So as a 
grows, the width of the bump in a Poisson distribution becomes thinner compared 
with the distance a from the origin to the center of the bump. This is indicated in 
Fig. 5.8, where we show the Poisson distributions for a = 100 and a = 1000. Note 
the different scales on the axes. 
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Figure 5.8: As a grows, the Poisson bump’s width (which is proportional to s/a) becomes 
thinner compared with the distance a from the origin to the center of the bump. 


We claimed at a few points in the above derivation that if a is large (as we are 
assuming), then any values of x that we are concerned with are much smaller than 
a. The logic behind this statement is exactly the same as the logic in the first remark 
in Section 5.1, because a appears in Eq. (5.22) in basically the same way that n 
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appears in Eq. (5.13). In a nutshell, for large a, the only values of x for which the 
l\\(x) in Eq. (5.22) is nonnegligible are ones that are much smaller than a. Values 
of x that are larger than this might lead to probabilities that don’t satisfy Eq. (5.22), 
but we don’t care, because these probabilities are so small. 


5.4 Binomial, Poisson, and Gaussian 

We have seen how the binomial, Poisson, and Gaussian distributions are related to 
each other in various limits. In Section 4.7.2 we showed how the binomial leads 
to the Poisson in the small-/; and large-/; limit. In Section 5.1 we showed how the 
binomial reduces to the Gaussian in the large-n limit. And in Section 5.3 we showed 
how the Poisson reduces to the Gaussian in the large-a limit. The summary of these 
relations is shown in Fig. 5.9. 


Binomial 


continuum limit 
(small p, large ri) 


Poisson 




Figure 5.9: How the binomial, Poisson, and Gaussian distributions relate in various limits. 


The detailed descriptions of the three relations are the following. 

• (Section 4.7.2) The vertical arrow on the left side of Fig. 5.9 indicates that 
the Poisson distribution is obtained from the binomial distribution by taking 
the continuum limit. By this we mean the following. Consider a given time 
(or space, etc.) interval t. Imagine that instead of having trials take place at a 
rate of no per time t (each with probability po of success), we have them take 
place at a rate of 10«o per time t (each with probability po/10 of success), 
or at a rate of 100 «o per time t (each with probability po/100 of success). 
And so on, with larger rates n and smaller probabilities /;, with the product 
pn held fixed at po«o- All of these scenarios have the same average of a - 
pono successes occurring per time t. And all of them are governed by the 
binomial distribution. But the more that time is subdivided (that is, the more 
continuously that the trials take place), the closer the probability distribution 
(for the number of successes per time t ) comes to the Poisson distribution 
given in Eq. (4.40), with a = po»o- We can imagine taking the n —> oo and 
p —> 0 limits, with the product pn held fixed at a. 

• (Section 5.1) The upper-right diagonal arrow in Fig. 5.9 indicates that the 
Gaussian distribution is obtained from the binomial distribution by taking the 
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large-/; limit, where n is the number of trials performed. In Eq. (5.14) we 
derived this result for p = 1/2, and then in Eq. (5.15) we stated the result for 
a general value of p. 

However, for general p, the condition under which the binomial reduces to 
the Gaussian turns out to be not just that n is large, but rather that both pn and 
(1 - p)n are large. The need for this stricter condition becomes apparent if 
you work through the (rather messy) derivation of the more general result in 
Eq. (5.15). 

Note that since p is at most 1, large pn (or (1 - p)n) necessarily implies large 
n. But the converse isn’t true. That is, large n doesn’t necessarily imply large 
pn and (1 - p)n. If p is an “everyday” number (such as p - 1/2 for a coin 
flip), then large n does in fact imply large pn and (1 - p)n. But if p is very 
small, then n needs to be extremely large (“doubly” large, in a sense), in order 
to make pn large. For example, if p = 10 -3 , then n = 10 3 doesn’t make pn 
large. We need n to be much larger, say, 10 5 or 10 6 . A similar statement holds 
with p replaced with 1 - p. 

Since pn and (1 — p)n are the expected numbers of success and failures in the 
binomial process involving n Bernoulli trials, we see that the condition under 
which the binomial reduces to the Gaussian is that both of these expected 
values are large. If neither p nor 1 —p is exceedingly small, then this condition 
reduces to the condition that n is large. 

• (Section 5.3) The lower-right diagonal arrow in Fig. 5.9 indicates that the 
Gaussian distribution is obtained from the Poisson distribution by taking the 
large-// limit, where a is the expected number of events that happen during the 
particular interval (of time, space, etc.) that you are considering. We derived 
this result in Eq. (5.22). The large-a limit in the Poisson-to-Gaussian case is 
consistent with the large-///; (and (1 - //)/;) limit in the binomial-to-Gaussian 
case, because both a and pn are the expected number of events/successes. 


5.5 The central limit theorem 

There are two paths in Fig. 5.9 that go from the binomial distribution to the Gaussian 
distribution. One goes directly by taking the large-///; and (1 -p)n limits (which are 
simply the large-/; limit if p isn’t extremely close to 0 or 1). The other goes via the 
Poisson distribution by first taking the continuum limit, and then taking the large-/; 
limit. The fact that all of the arrows in Fig. 5.9 eventually end up at the Gaussian 
(equivalently, that no arrows point away from the Gaussian) is consistent with the 
central limit theorem. There are different forms of this theorem, but in the most 
common form, it says that under a reasonable set of assumptions: 

• Central limit theorem: 

If you perform a large number of trials of a random process, then the proba¬ 
bility distribution for the sum (or average) of the outcomes is approximately 
a Gaussian (or “normal”) distribution. The greater the number of trials, the 
better the Gaussian approximation. 
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The formal proof of this theorem involves some heavy-duty math, so we won’t give 
it here. We’ll instead just look at some examples that hopefully will convince you 
of the theorem’s validity. 

Let’s start with the coin-flipping scenarios in Fig. 5.1. The central limit theorem 
requires that the trials have numerical values. So technically the outcomes of Heads 
and Tails aren’t applicable. But if we assign the value 1 to Heads and 0 to Tails, then 
we have a Bernoulli process with proper numerical values. The sum of the outcomes 
of many of these Bernoulli trials is then simply the number of Heads, which is just 
what appears on the x axis (relative to the expected number) in Fig. 5.1. For two 
trials (flips), the probability distribution doesn’t match up too well with a Gaussian. 
But for six flips, it matches up reasonably well. And for 20 flips, it matches up 
extremely well. 

So far, there is nothing new here. Coin flips are governed by the binomial distri¬ 
bution (which arises from the sum of n Bernoulli trials), and we already know that a 
binomial distribution reduces to a Gaussian distribution when n is large. The power 
of the central limit theorem comes from the fact that we can start with any arbitrary 
distribution (not just a Bernoulli one), and if we perform a large number of trials, 
the sum will be approximately Gaussian distributed. 

For example, imagine rolling a large number of dice and looking at the probabil¬ 
ity distribution for their sum. 1 The probability distribution for a single die consists 
of six points on a horizontal line, because all six numbers have equal probabilities 
of 1/6. But the central limit theorem says that if we roll 100 dice, the distribution 
for the sum will be (essentially) a Gaussian centered around 350, since the average 
for each roll is 3.5. We can therefore start with a flat-line distribution, and then if 
we perform enough trials, we get a Gaussian distribution for the sum. If you want to 
experimentally verify this, you will need to consider a large number of sets of trials, 
with each set consisting of 100 trials (rolls). This is a task best left for a computer 
and a random number generator! 

Note that (as stated in the theorem) we need the number of trials (die rolls, coin 
flips, etc.) to be large. If you roll only one die, then the plot of the probability 
distribution for the sum (which is just the single number showing) simply consists 
of six points on a horizontal line. This row of six points certainly does not look like 
a Gaussian curve. If you instead roll two dice, then as an exercise you can show that 
Table 1.5 implies that the distribution for the sum takes the shape of a triangle that 
is peaked at 2 • 3.5 = 7. This triangle isn’t a Gaussian either. But it’s closer to a 
Gaussian than a flat line. If you roll three dice, the distribution for the sum (which is 
peaked at 3 • 3.5 = 10.5) takes a curved shape that starts to look like a Gaussian; see 
Fig. 5.10. 2 With 10 dice, the distribution takes a Gaussian shape, for all practical 
purposes. The meaning of the word “large” in the first line of the statement of the 
central limit theorem depends on the process at hand. But in most cases, 10 or 20 


We’re now doing something new. With the exception of a brief mention of the sum of two dice 
on page 11, all of our previous encounters with dice in this book have involved the number of times a 
particular face comes up. We generally haven’t dealt with the sum of the dice. 

2 These histograms were generated numerically. Each bin is associated with the value at its lower end. 
Technically these histograms aren’t probability distributions, because we’re plotting the actual number 
of times each sum occurs, instead of the probability that it occurs. But the probability that each sum 
occurs is obtained by just dividing the number of times it occurs by the 10 6 sets of rolls. 
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trials of the process (rolls here) are plenty sufficient to yield an essentially Gaussian 
distribution. 




S = roll of 1 die S= sum of 2 dice 




Figure 5.10: Illustration of the central limit theorem. These histograms were generated nu¬ 
merically with n s = 10 6 sets of n t dice rolls, for nt = 1, 2, 3, 10. Each bin in the histograms 
is associated with the value at its lower end. If the number n t of trials in each set is reason¬ 
ably large, then the probability distribution is essentially Gaussian, as illustrated by the last 
histogram. 

The above examples dealt with the sum of the values of the random variable. 
But the central limit theorem holds for the average of the values too, of course, 
because the average is obtained by dividing the sum by a particular number, namely 
the number n t of trials (dice rolls, etc.). So if the histogram of the sum takes a 
Gaussian form, then so does the histogram of the average. The numbers on the x 
axis are simply reduced by a factor of n t . If we work with averages, the histograms 
in Fig. 5.10 will all be centered around 3.5. 

The numbers n t and n s 

We should clarify the distinction between the two types of large numbers that arise 
when talking about the central limit theorem: 

• The first is the number n, of trials that generate each data point. (Each data 
point is the sum or average of the results of the n t trials.) For example, n t 
might be 10 dice rolls, or 50 coin flips. The distribution for the sum of the 
random variables associated with these n t trials has an (approximately) Gaus¬ 
sian shape if n t is large. Usually n t ~ 20 is sufficiently large. 

• The second is the number n s of sets, each consisting of n t trials, that you must 
consider if you want to experimentally measure the distribution of the data 
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points. Each set generates a data point (which is the sum of the results of 
the « t trials in that particular set). If n s isn’t large, then you won’t get good 
statistics; the measured distribution will be choppy. For all of the numerically 
generated histograms in Fig. 5.10, we used n s = 10 6 . This number was large 
enough so that all of the histograms have essentially the same shape as the 
actual theoretical probability distribution. 


An important difference between n t and n s is the following. The true (theoreti¬ 
cal) probability distribution for the sum of n t trials depends on n t , of course (along 
with the specifics of what each trial involves - coins, dice, or whatever). However, 
the true distribution has nothing to do with n s . This number is simply the number 
of sets, each consisting of « t trials, that you are considering if you are trying to ex¬ 
perimentally determine the true distribution for the sum of the n t trials. But the true 
distribution (which depends on n t ) exists whether or not you try to determine it by 
considering an arbitrary number n s of sets. 

As an example of why n s must be large (if you want to accurately determine the 
true distribution), consider n t — 10 dice rolls. The probability distribution for the 
sum of the 10 dice is essentially a Gaussian (even though 10 isn’t a terribly large 
number) that is centered at 10 • 3.5 = 35, as we saw in the fourth plot in Fig. 5.10. 
If you want to experimentally verify that this is indeed the distribution, it won’t do 
much good to consider only n s = 100 sets of n t = 10 rolls. The distribution of the 
100 observed data points (sums of 10 dice) might look like the first histogram in 
Fig. 5.11. (As in Fig. 5.10, each bin in these histograms is associated with the value 
at its lower end.) This isn’t much of a Gaussian. But if we increase the number 
of sets to n s = 1000, 10,000, or 100,000, we obtain the three other histograms 
shown, which progressively look more and more like a Gaussian. We see that a 
nice Gaussian is obtained with n t = 10 (which isn’t that large) and n s = 100,000 
(which is quite large). So perhaps the numbers n t and h s can be better described 
with, respectively, the words “at least medium-ish’’ and “large.” Note that since the 
n s = 10 5 plot in Fig. 5.11 is already quite smooth, nothing much was gained by 
increasing n s to 10 6 in the fourth plot (with n t = 10) in Fig. 5.10. These two plots 
are essentially the same (up to a factor of 10 on the vertical axis). See Problem 5.6 
for the exact shape. 

One more important point: Figs. 5.10 and 5.11 both show a progression of his¬ 
tograms that become more and more Gaussian, so we should reiterate exactly what 
each figure illustrates. In Fig. 5.10, the progression of histograms is the statement 
of the central limit theorem: the probability distribution approaches a Gaussian as 
the number of trials n t (whose sum or average we are taking) grows. Because the 
n s = 10 6 value we used is so large, all of the histograms have essentially the same 
shape as the actual probability distributions. In contrast, in Fig. 5.11 the progres¬ 
sion of histograms is simply the statement that we need to consider a large number 
n s of data points if we want to produce a good (not noisy) approximation to the 
actual probability distribution, which in the present case happens to be essentially 
Gaussian, due to (1) the central limit theorem and (2) the reasonably large number 
n t =10. 
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«s = 100 m s = 1000 






S 


S= sum of Mt = 10 dice 
M s = number of sets of nt=10 dice 


Figure 5.11: Illustration of why n s needs to be large. If m s isn’t large, then the observed dis¬ 
tribution doesn’t look like the actual probability distribution. This figure is not an illustration 
of the central limit theorem. It is an illustration of the fact that the data is “noisy” when n s is 
small. Each bin in the histograms is associated with the value at its lower end. 


Two more examples 

The central limit theorem holds for any underlying probability distribution (subject 
to some reasonable assumptions). We know from Fig. 5.10 that the theorem holds 
for the sum (or average) of many dice rolls, where the underlying distribution is a flat 
line of six points. And we also know from our binomial-to-Gaussian derivation in 
Section 5.1 that the theorem holds for the sum (or average) of the number of Heads 
that appear in many coin flips, where the underlying distribution is a Bernoulli one. 
But the theorem also holds for other underlying probability distributions that don’t 
look as nice. For example, consider the discrete distribution shown in Fig. 5.12. 
The probabilities for the three possible outcomes are p( 2) = 0.6, p( 3.2) = 0.1, and 
p(7) = 0.3. 

You can quickly show that the expectation value of this distribution is 3.62. The 
central limit theorem says that the probability distribution for the average of, say, 
100 numbers chosen from the distribution is a Gaussian centered at 3.62. And in¬ 
deed, Fig. 5.13 shows a Gaussian histogram of n s = 100,000 numerically generated 
data points, each of which is the average of n t = 100 numbers chosen from the 
distribution. The histogram is centered at about 3.6. 

All of the examples so far in this section have involved discrete distributions. 
But the central limit theorem holds for continuous distributions too. Fig. 5.14 shows 
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P(x) 
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0.3 ■■ 
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Figure 5.12: An arbitrary probability distribution with three possible outcomes. 



A = average of «t=100 numbers from 
the distribution in Fig 5.12 

Figure 5.13: A histogram of n s = 100,000 averages of n t = 100 numbers chosen from the 
distribution in Fig. 5.12. 

a Gaussian histogram of n s = 100,000 numerically generated data points, each of 
which is the average of n x = 50 numbers taken from a uniform distribution ranging 
from 0 to 1. The average of this distribution is simply 0.5, which is correctly where 
the Gaussian is centered. The task of Problem 5.7 is to verify that the histograms in 
Figs. 5.13 and 5.14 have the correct standard deviations. 

We mentioned above right after the statement of the central limit theorem that 
due to the math involved, we haven’t included a proof. But hopefully the above 
examples have convinced you of the theorem’s validity. 


5.6 Summary 

• For a large number of trials, n, a binomial distribution reduces to a Gaussian 
distribution. We showed this for coin flips, but it also holds for a binomial 
distribution governed by a general probability p. The standard deviation of 
the Gaussian is y/np( 1 - p). 

• The law of large numbers states that the measured probability over a large 
number of trials will be essentially equal to the theoretical probability. This 
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A = average of «t = 50 numbers 
from a uniform distribution 


Figure 5.14: A histogram of « s = 100,000 averages of n t = 50 numbers chosen from 
a uniform distribution (from 0 to 1). 


law is a consequence of the fact that a Gaussian distribution has the property 
that the larger the number of trials, the thinner the distribution’s bump, relative 
to the whole span of possible outcomes. 

• In the limit of a large expected number of events, a, a Poisson distribution 
reduces to a Gaussian distribution. The standard deviation of the Gaussian is 
sfa. 

• The central limit theorem says (in its most common form) that if you perform 
a large number of trials of a random process, the probability distribution for 
the sum (or average) of the outcomes is approximately a Gaussian distribu¬ 
tion. 


5.7 Exercises 

See www.people.fas.harvard.edu/~djmorin/book.html for a supply of problems 
without included solutions. 


5.8 Problems 

Section 5.1: Binomial and Gaussian 

5.1. Equal percentages ** 

In the last paragraph of Section 2.1, the same percentage 99.999999975%, 
appeared twice. Explain why you know that these two percentages must be 
the same, even if you don’t know what the common value is. 

5.2. Rolling sixes ** 

In the solution to Problem 2.13 (known as the Newton-Pepys problem), we 
noted that the answer to the question, “If 6 n dice are rolled, what is the prob- 
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ability of obtaining at least n 6’s?,” approaches 1/2 in the n —> oo limit. 
Explain why this is the case. 

5.3. Coin flips ** 

If you flip 10 4 coins, how surprised would you be if the observed percentage 
of Heads differs from the expected value of 50% by more than 1%? Answer 
the same question for 10 6 coins. (These numbers are large enough so that the 
binomial distribution can be approximated by a Gaussian.) 

5.4. Identical distributions ** 

A thousand dice are rolled. Fig. 5.15 shows the probability distribution (given 
by Eq. (5.15)) for the number of 6’s that appear, relative to the expected num¬ 
ber (which is 167). How many coins should you flip if you want the probabil¬ 
ity distribution for the number of Heads that appear (relative to the expected 
number) to look exactly like the distribution in Fig. 5.15 (at least in the Gaus¬ 
sian approximation)? 


P(x) 



Figure 5.15: The probability distribution for the number of 6’s in 1000 dice rolls, 
relative to the expected number, 167. 


Section 5.2: The law of large numbers 

5.5. Gambler’s fallacy * 

Assume that after 20 coin flips, you have obtained only five Heads. The 
probability of this happening is small (about 1.5%, since ( 2 5 °)/2 20 = 0.0148), 
but not negligible. Since the law of large numbers says that the fraction of 
Heads approaches 50% as the number of flips gets large, should you expect 
to see more Heads than Tails in future flips? 

Section 5.5: The central limit theorem 

5.6. Finding the Gaussian ** 

What is the explicit form of the Gaussian function f(x) that matches up with 
the fourth histogram in Fig. 5.11? Assume that n t = 10 is large enough so 
that the Gaussian approximation does indeed hold. 
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5.7. Standard deviations ** 

Calculate the theoretically predicted standard deviations of the histograms in 
Figs. 5.13 and 5.14, and check that your results are consistent with a visual 
inspection of the histograms. You will need the result from Problem 4.3 for 
Fig. 5.14. 


5.9 Solutions 


5.1. Equal percentages 

Looking back at the given paragraph, our goal is to show that the probability of obtain¬ 
ing Heads between 49% and 51% of the time in 10 5 coin flips equals the probability 
of obtaining Heads between 49.99% and 50.01% of the time in 10 9 coin flips. In the 
first case, the 1% deviation from average corresponds to 10 5 /10 2 = 10 3 Heads. In the 
second case, the 0.01% deviation from average corresponds to 10 9 /10 4 = 10 5 Heads. 
Eq. (3.48) gives the standard deviation of the number of Heads that show up in n 
tosses of a fair coin as tr = s/n/4. In the present two cases, this yields standard devi¬ 
ations of sj 10 5 /4 = 158 and sj 10 9 /4 = 15,800. The numbers of standard deviations 
corresponding to the above two spreads of 10 3 and 10 5 Heads are therefore 


10 3 

L58 


= 6.3 


and 


10 5 

15,800 


6.3. 


(5.24) 


Since these numbers are equal, the probabilities of lying within the two specified 
ranges must be equal. We have used the fact that the Gaussian approximation is valid 
in both scenarios, which implies that the distribution relative to the mean is completely 
determined by the standard deviation. 

If you want to show that the common probability equals 99.999999975%, you can 
numerically either add up the exact binomial probabilities in the given ranges, or 
integrate the Gaussian approximations over the given ranges. A computer will be 
necessary for either option, of course. (Performing the sums or integrals over the 
complementary regions outside the given ranges works just as well, or even better.) 

5.2. Rolling sixes 

From Eq. (5.15) and the surrounding discussion, we know that for a large number 
of rolls, the binomial distribution for the number of 6’s that appear is essentially a 
Gaussian distribution centered at the mean (which is n, if there are 6 n rolls). Since the 
Gaussian distribution is symmetric around the mean, we conclude that there is a 1/2 
chance that the number of 6’s is greater than or equal to the mean value, n, as desired. 
Technically, the probability is slightly larger than 1/2. This is true because if we split 
the Gaussian distribution exactly down the middle, then we’re including only half of 
the probability of obtaining exactly n 6’s. We need to include all of this probability, 
because we’re concerned with the probability of at least n 6’s. The probability of 
obtaining at least n 6’s in 6 n rolls therefore equals 1/2 plus half of the probability of 
obtaining exactly n 6’s. But if n is large, the probability of obtaining exactly n 6’s is 
small, so it doesn’t matter much if we ignore half of it. 

If n is small, the above logic doesn’t hold. This is consistent with the fact that the 
probability of obtaining at least n 6’s can be appreciably more than 1/2, as we saw in 
Problem 2.13. For small n, the above logic breaks down partly because the probability 
of obtaining exactly n 6’s is appreciable, and partly because the Gaussian approxima¬ 
tion doesn't hold for small n. 
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5.3. Coin flips 

Eq. (3.48) gives the standard deviation of the number of Heads that show up in n 
tosses of a fair coin as tr = s/n/4. For n = 10 4 and 10° this yields <x - 50 and 500. 
And since 1% of 10 4 and 10 6 equals 10 2 and 10 4 , these ±1% spreads are equal to 
10 2 /50 = 2 and 10 4 /500 = 20 standard deviations. 

As noted in Fig. 4.26, the probability of being within 2tr of the mean is 95%. So in 
the case of 10 4 coins, there is a 5% probability that the percentage of Heads differs 
from the expected value of 50% by more than 1%. 5% is small but not negligible, so 
if you observe a deviation larger than 1%, you will probably be mildly surprised. 

In contrast, the probability of being within 20cr of the mean is exactly 1, for all prac¬ 
tical purposes. We mentioned near the end of Section 4.8 that the probability of being 
within five standard deviations of the mean is about 0.9999994. Tables of these prob¬ 
abilities don't even bother going anywhere near 20<x, because the probability is so 
close to 1. So in the case of 10 6 coins, there is a 0% probability that the percentage 
of Heads differs from the expected value of 50% by more than 1%. Therefore, even if 
you said that you would be “extremely, outrageously, massively surprised” if the de¬ 
viation from the mean exceeded 1%, that still doesn’t do justice to the unlikelyhood. 
You are simply not going to end up 20<r from the mean, period; see the remark below. 
The law of large numbers is a powerful thing! 

What about 10 5 coins, which is between the above two cases? From Problem 5.1, we 
know that the probability that the percentage of Heads differs from 50% by more than 
1% equals 0.000000025%. So if we increase the number of coins from 10 4 to Hr, the 
probability of being outside the ±1% marks drops from a reasonable 5% to essentially 
zero. And then for 10° coins, the probability is exactly zero for all practical purposes. 


Remark: Fet's produce a (very rough) upper bound on the probability of being outside 
20 <t. If x = 20<x, then the second expression in Eq. (4.42) gives a probability density 
of 


f(20o-) = 


e -(20cr) 2 /2o- 2 
V2 TUT 2 


e -™ 2 ' 1 ~ 1 ( T 87 
tr x[2n cr spin 


(5.25) 


If x = 21cr, you can show that the above 10 -87 factor becomes 10 -96 . So /(21<x) is 
completely negligible compared with /(20cr). We can therefore assume that /(21<x) 
is exactly zero. To obtain an upper bound on the area of the distribution that lies 
outside 20<x, we can assume that f(x) takes on the constant value of /(20<r) between 
x = 20cr and x = 21<x, and then suddenly drops to zero. Of course, it doesn't take 
on this constant value; it decreases fairly quickly to nearly zero. But all we care about 
here is obtaining an upper bound on the area; a significant overestimate is fine for our 
purposes. So assuming a constant value of /(20cr) between x = 20cr and x = 21<x, the 
area in this span of one standard deviation is <x • /(20<x), which from Eq. (5.25) equals 
10 -87 / s[2n. Doubling this (to account for the span between -20cr and -21cr) gives 
sfTfn • 10“ 87 as an upper bound on the area. We can therefore say that a (generous) 
upper bound on the probability of being outside 20cr is 10 -87 . The actual probability 
obtained numerically from the exact binomial distribution is 5.5-10 -89 , which is about 
20 times smaller than 10 -87 . 


To get an idea of how ridiculously small this probability is, imagine (quite hypotheti¬ 
cally) gathering together as many people as there are protons and neutrons in the earth 
(roughly 4-10 51 ), and imagine each person running the given experiment (flipping 10 6 
coins) once a second for the entire age of the universe (roughly 4 • 10 17 seconds). And 
then repeat this whole process a quintillion (10 18 ) times. This will yield 1.6-10 87 runs 
of the experiment, in which case (working with our high 10 -87 estimate) you might 
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expect one or two runs to have percentages of Heads that differ from 50% by more 
than 1%. But again, this is a high estimate, given the actual probability of 5.5 • 10 -89 . 
Although most people might think that there is a nonnegligible probability of obtaining 
more than 510,000 or fewer than 490,000 Heads in 10° coin flips, the probability is in 
fact zero, for all practical purposes. (As another example, you can show that the same 
20<x result applies when flipping a trillion coins and ending up outside the 49.999% 
to 50.001% range.) The above probability of 10 -87 isn’t just small; it is ridiculously 
small. The moral of all this is that unless you think in terms of the standard deviation 
(which is proportional to yfn ), it’s hard to get any intuition for these types of setups. 
People have a tendency to think linearly, that is, to assume that a reasonable deviation 
from the mean might be, say, n/10 or n/100. independent of the size of n. This linear 
thinking will lead you astray. * 

5.4. Identical distributions 

Fig. 5.15 is a plot of the P(x) in Eq. (5.15), with n <j = 1000 and p ^ = 1/6. (The “d” 
here is for dice.) P(x) is completely determined by the product np(l-p), because this 
product appears in both the exponent and the denominator in Eq. (5.15). We therefore 
want to find the value of n c (with “c” for coin) such that 

«cPc(l ~Pc) = «dPd(! ~Pi)- (5-26) 


Since p c = 1 /2, this gives 

11 1 5 _ 5 

” c 2 2 = ° d 6 6 =* Wc= 9" d ' (5 ' 27) 

In the given case with n<j = 1000, this yields n c = 556. The exact binomial dis¬ 
tributions for the two processes aren’t exactly identical, of course, but they are both 
extremely close to the common Gaussian approximation in Eq. (5.15). 

The common value of np( 1 - p) is 139. The standard deviation is the square root 
of this (by comparing Eq. (5.15) with Eq. (4.42)), so <x « 12. This is consistent 
with a visual inspection of Fig. 5.15. Note that the expected number of Heads is 
556/2 = 278, but this number is irrelevant here, because we’re concerned only with 
the distribution relative to the average. The means Pd”d = 167 and p c n c = 278 are 
necessarily different, because there is no way to simultaneously make the np( 1 - p) 
values equal and the pn values equal, since these quantities differ by the factor 1 - p. 

Remark: At first glance, it might not be obvious that an n c should exist that yields 
the same distribution relative to the mean. But it is clear once you realize that both 
distributions are Gaussians, and that (ignoring the p in Eq. (4.42) since we’re looking 
at the distributions relative to the mean) the Gaussians depend on only one parameter, 
<r. So if we can generate the same <x, then we can generate the same distribution. And 
we can indeed generate the same cr, because £r co ; n = s/n/4 from Eq. (3.48), so we 
just need to pick the appropriate n. * 

5.5. Gambler’s fallacy 

No. Each coin flip is independent of the flips that have already taken place. There¬ 
fore, there is no reason to expect more Heads than Tails in future flips. Past flips are 
irrelevant. 

This incorrect interpretation (that there will be more Heads than Tails in future flips) of 
the law of large numbers arises from the confusion between additive and multiplicative 
differences. If you obtain five Heads in 20 flips, then you are five Heads below average. 
If you keep flipping more coins, then on average you will always be five Heads below 
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average. However, if you are worried about remaining below average, these five Heads 
are the least of your worries if you end up flipping a million coins. The standard 
deviation for n = 10 6 coin flips is s/n/ 4 = 500, so you are nearly certain to end up 
much farther than five Heads from average (above or below). 

The deficiency of five Heads will always be there (on average, if you have many clones 
of this scenario), but it represents a smaller and smaller fraction of the number of 
Heads, as the number of flips gets larger and larger. The numbers of Heads and Tails do 
not somehow conspire to equalize in the long run. On the contrary, the numbers tend 
to diverge, with their difference generally being on the order of the standard deviation 
(proportional to sfn ). However, this difference becomes a smaller and smaller fraction 
of the number of flips (namely n), which means that the fractions of Heads and Tails 
both approach 1 /2. But there is certainly no conspiring going on. All future outcomes 
are independent of all past outcomes. 

The incorrect interpretation (that there will be more Heads than Tails in future flips) 
of the law of large numbers is known as the gambler’s fallacy. Alternatively, it can 
simply be called wishful thinking. 

5.6. Finding the Gaussian 

From the first example in Section 3.2, we know that the variance of a single die roll is 
2.92. The standard deviation is therefore V2.92 = 1.71. Eq. (3.45) then tells us that 
the standard deviation of the sum of the rolls of n t = 10 dice is <x = VT0(1.71) = 5.4. 
This is the cr that appears in the second Gaussian expression in Eq. (4.42). And the 
mean p of the sum of 10 dice rolls is 10 • 3.5 = 35. So the desired Gaussian function 
is 

fix) = (100,000),/ 1 e -U-35) 2 /2(5.4)\ (5.28) 

\ 2;r(5.4) 2 

The factor of 100,000 out front arises because the histograms in Fig. 5.11 deal with 
the actual number of outcomes. So we need to multiply the probability distribution in 
Eq. (4.42) by n s = 100,000 to obtain the histogram. 

However, if we want to be picky, we must remember that each histogram bin in 
Fig. 5.11 is associated with the value at its lower end. And since each bin has width 
1, the histogram is shifted by 0.5 to the right from where it would be if each bin were 
centered on the associated value of x. So we actually want the p in Eq. (4.42) to be 
35.5. (This correction has nothing to do with the probability concepts we’re covering 
here. It’s just a figment of the way we plotted the histograms. The true mean is simply 
p = 35.) The function that matches up with the histogram is therefore 

f(x) = (100,000) J 1 _ e -(*-35.5) 2 /2(5.4)\ (5.29) 

\ 2;r(5.4) 2 

Fig. 5.16 shows a plot of this function superimposed on the histogram. The other 
three histograms in Fig. 5.11 come from the same underlying probability distribution 
(because they all involve 10 dice). But they're less smooth because the smaller n s 
values allow a few random fluctuations to strongly influence the histograms. 

5.7. Standard deviations 

Consider Fig. 5.13 first. The underlying distribution has probailities pi 2) = 0.6, 
pi 3.2) = 0.1, and p(l) = 0.3. The mean is therefore 


( 0 . 6 ) ( 2 ) + ( 0 . 1 ) ( 3 . 2 ) + ( 0 . 3 )( 7 ) = 3 . 62 . 


( 5 . 30 ) 
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Figure 5.16: The Gaussian approximation to the fourth histogram in Fig. 5.11. 


The standard deviation is then 

y)( 0.6)(2 - 3.62) 2 + (0.1)(3.2 - 3.62) 2 + (0.3)(7 - 3.62) 2 = 2.24. (5.31) 

From Eq. (3.53) the standard deviation of the average of 100 numbers taken from this 
distribution is <x aV g = (2.24)/ V100 = 0.224. This is consistent with Fig. 5.13 (the 
spacing between tick marks on the x axis is 0.05). Remember from Fig. 4.25 that at 
one standard deviation from the mean, the distribution is I / sfe = 0.61 as tall as the 
peak. 

Now consider Fig. 5.14. From Problem 4.3, the standard deviation of the uniform 
distribution from 0 to 1 is <x = 1/Vl2 = 0.29. And then from Eq. (3.53) the stan¬ 
dard deviation of the average of 50 numbers taken from this distribution is <x av g = 
(0.29)/V50 = 0.041. This is consistent with Fig. 5.14 (the spacing between tick 
marks on the x axis is 0.01). 





Chapter 6 

Correlation and regression 


In this chapter, we will consider how two different random variables may be re¬ 
lated, or correlated. In Section 6.1 we give some examples of what it means for 
two variables to be correlated or uncorrelated. In Section 6.2 we present a model 
for how two variables can be correlated, based on the given underlying probability 
distributions. We then get quantitative in Section 6.3 and derive expressions for the 
correlation coefficient , r. One of these expressions involves the covariance of the 
two variables. We show in Section 6.4 how we can take advantage of a correlation 
to make an improved prediction for the Y value associated with a given X value. 
In Section 6.5 we calculate the joint probability density p{x,y) in terms of <x t , cr y , 
and r, in the case where the underlying distributions are Gaussian. We find that the 
curves of constant p(x,y) are ellipses. We analyze these ellipses in Section 6.6. 

In Section 6.7 we discuss the all-important regression lines , which give the ex¬ 
pected value of Y, given X (or the expected value of X, given Y). We then present in 
Section 6.8 two examples on the use of regression lines. A ubiquitous effect here is 
regression toward the mean. Finally, in Section 6.9 we analyze the best-fit (or least- 
squares) line. We find that this line is none other than the regression line. Indeed, 
the regression line is often defined as the least-squares line. We have chosen to take 
a different route in this chapter and introduce the regression line by considering the 
underlying probability distributions that produce the random variable Y. This route 
makes it easier to see what’s going on “under the hood.” But it’s good to see that 
we end up with the same regression line, independent of what route we take. 


6.1 The concept of correlation 

Consider a pair of random variables X and Y. For example, X might be an object’s 
mass measured in kilograms, and Y might be its mass measured in grams. Or X 
might be a person’s height, and Y might be his/her shoe size. Or X might be the 
alphabetical placement of the second letter in a person’s last name (A — 1, B = 2, 
etc.), and Y might be his/her cholesterol level. 

One of the main issues we will address in this chapter is the degree to which 
knowledge of X helps predict Y (or vice versa). Equivalently, we will address the 
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degree to which two variables are correlated. The larger the correlation, the more 
that one variable helps predict the other. We’ll be precise about this in Section 6.3 
when we define the correlation coefficient, usually denoted by the letter r. To get 
a qualitative feel for what a correlation means, let’s consider the three examples 
mentioned above, which range from perfect correlation to no correlation at all. 

• Perfect correlation: An example of perfect correlation is the mass of an 
object expressed in kilograms or grams. If we know the mass X in kilograms, 
then we also know the mass Y in grams. We simply need to multiply by 
1000. That is, Y = 1000A. One kilogram equals 1000 grams, 2.73 kilograms 
equals 2730 grams, etc. Knowledge of the mass in kilograms allows us to 
state exactly what the mass is in grams. (The converse is also true, of course. 
Knowledge of the mass in grams allows us to state exactly what the mass 
is in kilograms. Just divide by 1000.) If we take a group of objects and 
determine their masses in kilograms and grams, and then plot the results, we 
will obtain something like the plot shown in Fig. 6.1. (We’ll assume for the 
present purpose that any measurement errors are negligible.) All of the points 
lie on a straight line. This is a consequence of the perfect correlation. 


m g 



Figure 6.1: The mass in grams is perfectly correlated with the mass in kilograms. 

• Some correlation: An example of nonzero but imperfect correlation is the 
second example mentioned above, involving height and shoe size. (Men’s 
and women’s shoe sizes use different scales, so let’s just look at men’s sizes 
here. Also, some manufacturer sizes run large or small, but we’ll ignore 
that issue.) We certainly don’t expect perfect correlation between height and 
shoe size, because that would mean we would be able to exactly predict a 
person’s shoe size based on height (or vice versa). This isn’t possible, of 
course, because all people who are six feet tall certainly don’t have the same 
shoe size. Additionally, there can’t be perfect correlation because shoe sizes 
use a discrete scale, whereas heights are continuous. 

But is there at least some correlation? That is, does knowledge of a person’s 
height allow us to make a better guess of his shoe size, compared with our 
guess if we had no knowledge of the height? Well, 6-footers certainly have a 
larger shoe size than 5-footers on average, so the answer should be yes. Of 
course, we might well find a 5-footer whose feet are larger than a 6-footer’s. 
But on average, a person’s shoe size increases with height. A scatter plot 
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of some data is shown in Fig. 6.2. (I asked a sampling of students for their 
height and shoe size. Height is measured to the nearest inch. Since men’s and 
women’s sizes use different scales, I used only the data for 26 male students.) 
From the data, the average shoe size of all 26 people is 10.4, whereas the 
average shoe size of a 6-footer is 11.4. So if you want to make a guess for 
the shoe size of a 6-footer, you’ll do better by guessing 11.4 than 10.4. After 
we introduce the correlation coefficient in Section 6.3, we’ll be able to be 
quantitative in Section 6.4 about how much better the guess is (at least for a 
large number of data points). 



Figure 6.2: A scatter plot of shoe size versus height (in inches). 


• Zero correlation: An example of zero correlation is the third example men¬ 
tioned above, involving the alphabetical placement (A - 1, B = 2, etc.) of the 
second letter in the last name, along with cholesterol level. It is highly doubt¬ 
ful that there is much of a correlation here. Would knowing that the second 
letter of a last name is “i” help you in predicting the cholesterol level? Negli¬ 
gibly, at best. Of course, certain names (Murphy, Smith, Li, etc.) are common 
in certain ethnicities, and it is undoubtedly the case that different ethnicities 
have slightly different cholesterol levels (on average) due to differing genes 
and diet. But let’s assume that this effect is small and is washed out by other 
effects. So for the sake of argument, we’ll assume that there is no correlation 
here. However, this example should convince you that small (or perhaps even 
large) correlations might pop up in situations where at first glance it’s hard to 
imagine any correlation! 

The first two of the above examples involve a positive correlation; an increase 
in X corresponds to an increase in Y (on average). The line (or general blob) of 
points in the scatter plot has an upward slope. It is also possible to have a negative 
correlation, where an increase in X corresponds to a decrease in Y (on average). 
The line (or general blob) of points in the scatter plot will then have a downward 
slope. An example of negative correlation is vitamin C intake and the incidence of 
scurvy. The more vitamin C you take, the less likely you are to have scurvy - at 
least on the low end of the intake scale; on the upper end it doesn’t matter how much 
you take. 
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Note that correlation does not necessarily imply causation. In the case of vita¬ 
min C and scurvy, there does happen to be causation; taking more vitamin C helps 
prevent you from getting scurvy. But in the case of height and shoe size, it isn’t 
that being tall causes your feet to be larger, any more than having large feet causes 
you to be taller. (The situation is symmetrical, so if you want to argue causation, 
you’ll be hard pressed to say which is causing which.) Instead, what’s going on 
is that there is a third thing, namely genetics (and diet too), that causes both your 
height and foot size to be larger or smaller (on average). Another example along 
these lines consists of the number of times that people in a given town on any given 
day put on sunglasses, along with the number of times they apply sunscreen. There 
is a positive correlation between these two things, but neither one causes the other. 
Instead, they are both caused by a third thing - sunshine! 

We’ll deal only with linear correlation in this chapter, although there are cer¬ 
tainly examples of nonlinear correlation. A simple example is the relation between 
the area of a square and its side length: area = (side length) 2 . This relation is 
quadratic, not linear. Another example is the relation between job income and a 
person’s age. Three-year-olds don’t earn much from working a job, and neither do 
100-year-olds (usually). So the plot of average income vs. age must start at zero, 
then increase to some maximum, and then decrease back to zero. 


6.2 A model for correlation 

Let’s now try to understand the general way in which two random variables can be 
correlated. This understanding will lead us to the correlation coefficient r in Sec¬ 
tion 6.3. For the purpose of making some pretty plots, we’ll assume in the present 
discussion that our two random variables each have Gaussian (normal) distributions. 
This assumption isn’t necessary; our mathematical results will hold for any distri¬ 
butions. Indeed, when dealing with actual real-world data, it is often the case that 
one or both of the variables are not normally distributed. The correlation coeffi¬ 
cient is still defined perfectly well by Eq. (6.6) or Eq. (6.9) below. However, due 
to the central limit theorem (see Section 5.5), many real-life random variables are 
approximately normally distributed. 

Consider a random variable X that is normally distributed with mean zero and 
standard deviation cr x \ 

X : p -0, a - cr x . (6.1) 

We have chosen the mean to be zero just to make our calculations and figures 
cleaner. All of the results below hold more generally for any mean. 

Consider another random variable Y that is correlated (to some extent) with 
X. By this we mean that Y is partially determined (in a linear manner) by X and 
partially determined by another random variable Z (assumed to be normally dis¬ 
tributed) that is independent of X. Z can in turn be the sum of many other random 
variables, all independent of X. We’re lumping the effect of all these variables into 
one variable Z. We can be quantitative about the dependence of Y on X and Z by 
writing Y as 


Y = mX + Z 


( 6 . 2 ) 
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where m is a numerical factor. To keep things simple, we will assume that the mean 
of Z is also zero. So if the standard deviation of Z is 0 %, we have 

Z: n = 0, u-a z . (6.3) 

Note that if we take the mean (expectation value) of Eq. (6.2), we see that the various 
means are related by 

H y = mn x + n z , (6.4) 

where we have used the fact that the expectation value of the sum equals the sum 
of the expectation values; see Eq. (3.7). Since we are assuming /j x = fj 7 = 0 here, 
Eq. (6.4) implies that /j y is also equal to zero. 

In Eq. (6.2), we are producing Y from two known (and independent) distribu¬ 
tions X and Z. To be explicit, the meaning of Eq. (6.2) is the following. Pick an 
x value of the random variable X and multiply the result by m to obtain mx. Then 
pick a z value of the random variable Z and add it to mx to obtain y = mx + z. This 
is the desired value of y. We can label this ordered pair of (X,Y) values as (xi,yi). 
We then repeat the process with new values of X and Z to obtain a second (X.Y) 
pair (x 2 ,yi)- And so on, for as many pairs as we like. 

As an example, Y could be the measured weight of an object, X could be the true 
weight, and Z could be the error introduced by the measurement process (reading 
the scale, behavior of the scale depending on a slightly lopsided placement of the 
object on it, etc.). These variables might not have Gaussian distributions, but again, 
that assumption isn’t critical in our discussion. In this example, m = 1. 

We should mention that although Eq. (6.2) is the starting point for deriving most 
of the correlation results in this chapter, rarely is it the starting point in practice. 
That is, rarely are you given the underlying X and Z distributions. Instead, you are 
invariably given some data, and you need to calculate the correlation coefficient r 
via Eq. (6.9) below. But the key to deriving Eq. (6.9) is realizing that we can write Y 
as mX + Z (at least in the case of linear correlation), even if we don’t know exactly 
what X and Z are. 

To see what sort of correlation Eq. (6.2) produces between X and Y, let’s con¬ 
sider two special cases, in order to get a general idea of the effects of m and Z. 

Perfect correlation (o- z = 0) 

If the standard deviation of Z is cr z = 0, then Z always just takes on the value z — 0, 
because we’re assuming that the mean of Z is zero. (More generally, Z takes on a 
constant value zq.) So Eq. (6.2) reduces to Y = mX. That is, Y is a fixed number 
m times X\ all values of x and y are related by y = mx. This means that all of the 
(x,y) points in the scatter plot lie on the straight line y = mx, as shown in Fig. 6.3 
for 100 random points generated numerically from a Gaussian distribution X. We 
have arbitrarily chosen m = 0.5 and cr x = 1. In the present case of a straight line, 
we say that X and Y are perfectly (or completely) correlated. The value of Y is 
completely determined by the value of X. There is no additional random variable Z 
to mess up this complete determination. 

In the case where cr z is small but nonzero, we obtain a strong but not perfect cor¬ 
relation. Fig. 6.4 shows a plot of 200 points in the case where cr z equals (0.1 )cr x . 
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Figure 6.3: Perfect correlation. 
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We have again chosen m = 0.5 and cr x = 1 (and hence <x z = 0.1). We have 
generated the points by picking 200 random values from each of the Gaussian dis¬ 
tributions X and Z, and then forming Y - mX + Z. In the present case of small 
<x z , knowledge of X is very helpful in predicting Y, although it doesn’t predict Y 
exactly. 
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Figure 6.4: Strong correlation. 


Zero correlation (m = 0) 

If m = 0, then Eq. (6.2) reduces to Y = Z. And since Z is independent of X, this 
means that Y is also independent of X. Fig. 6.5 shows a plot of 2000 points in the 
case where m = 0. We have arbitrarily chosen <r x - 2 and cr z = 1. We have 
generated the points by picking 2000 random values from each of the Gaussian 
distributions X and Z, and then setting Y equal to Z. 

It is clear from Fig. 6.5 that X and Y are completely uncorrelated. The distribu¬ 
tion for Y is independent of the value of X. That is, for any given value of X, the 
Y values are normally distributed around Y = 0, with the same standard deviation 
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Figure 6.5: Zero correlation. 


(which equals cr z ). In other words, the probability (or rather, the probability den¬ 
sity) of obtaining a certain value of Y, given a particular value of X , is independent 
of the X value. This probability is given by the Gaussian distribution for Z, since 
Y = Z in the present case where m = 0. 

If we imagine drawing vertical shaded strips at two different values of X, as 
shown in Fig. 6.6 (which is the same as Fig. 6.5, except with 10,000 points), then 
the distributions of Y values in these two strips are the same, except for an overall 
scaling factor. This scaling factor is simply the probability (or rather, the probability 
density) of obtaining each of the given values of X. Larger values of \X\ are less 
likely, due to the e~ x ^ 1,Tx factor in the Gaussian distribution. So there are fewer 
dots in the right strip. But given a value of X, the probability distribution for Y (in 
this m = 0 case) is simply the probability distribution for Z, which is independent 
of X. 



Figure 6.6: If m = 0, the distribution of Y values within a vertical strip is independent (aside 
from an overall scaling factor) of the location of the strip. 
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In the case were m is small but nonzero, we obtain a weak correlation. Fig. 6.7 
shows a plot of 2000 points in the case where m = 0.2, again with cr x = 2 and 
<t z = 1. In this case, knowledge of X helps a little bit in predicting the Y value. It 
doesn’t help much in the region near the origin; the plot doesn’t display much of 
a tilt there (it looks basically the same as Fig. 6.5 near the origin). But for larger 
values of X, there is a clear bias in the values of Y. More points lie above the X axis 
on the right side of the plot, and more points lie below the X axis on the left side. 



Figure 6.7: Weak correlation. 


Remarks: 

1. All of the above scatter plots are centered at the origin because we assumed that the 
means of X and Z are zero, which implies that the mean of Y is zero, from Eq. (6.4). 
If Z instead had a nonzero mean p z , then the blob of points would be shifted upward 
by p z . If A had a nonzero mean p x , then the blob would be shifted rightward by p x 
and also upward mp x . 

2. In the above discussions, we treated X as the independent variable and Y as the de¬ 
pendent variable, and we looked at the extent to which X determined Y. However, if 
someone gives you one of the above scatter plots, you could quite reasonably tilt your 
head sideways and consider A to be a “function” of Y, and then look at the extent to 
which Y determines A. We will discuss this alternative way of relating the variables 
in Sections 6.5 and 6.7. 

3. We noted in Fig. 6.6 that the relative distribution of Y values within a vertical strip 
is independent of the location of the strip. This fact holds not only in Fig. 6.6 where 
there is zero correlation, but also (in a slightly modified sense) in the case of nonzero 
correlation, even when there is strong correlation as in Fig. 6.4. Although it might 
seem like the spread (the standard deviation) of Y values gets smaller out in the tails of 
the plot in Fig. 6.4, the spread is in fact the same for all values of A. The Y = mX + Z 
expression tells us that for any given value of A, the Y values are centered at mX 
(instead of zero; this is the aforementioned slight modification) and have the same 
standard deviation of cr z around this value. The spread seems to be larger in the 
middle of the plot, but only because there are more points there. * 
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6.3 The correlation coefficient, r 

We will now show how to produce the correlation coefficient r from quantities as¬ 
sociated with the righthand side of Eq. (6.2), namely m, cr x , and ir z . To do this, we 
will need to determine the standard deviation of Y — mX + Z. We know that mX 
has a standard deviation mcr x and Z has a standard deviation of cr z . And we know 
from Eq. (3.42) that the standard deviation of the sum of two independent variables 
(as X and Z are) is obtained by adding the two standard deviations in quadrature. 
(The variables need not be Gaussian for this to be true.) Therefore, Y is described 

by: 

Y : n — 0, cr y = yjm 2 cr 2 + crz. (6.5) 

The /j - 0 value follows from Eq. (6.4), since we are assuming fi x - /j z — 0. 

Let’s check some limiting cases of Eq. (6.5). In one extreme where cr z = 0 
(complete correlation between X and F), we have cr y = mcr x . All of the standard 
deviation of Y comes from A; none of it comes from Z. In the other extreme where 
m = 0 (no correlation between X and F), we have cr y — <r z . All of the standard 
deviation of F comes from Z; none of it comes from X. 

For general values of m and cr z , we define the correlation coefficient r to be the 
fraction of cr y that can be attributed to X (assuming a linear model). Since the part 
of cr y that can be attributed to X is mcr x , this fraction is 


mcrx 

mcr x 

l 

b 

I 


'nS-cr 2 +1r 2 z 


(correlation coefficient) 


( 6 . 6 ) 


Equivalently, r 2 equals the fraction of the variance of F that can be attributed to X. 

The use of the expression for r in Eq. (6.6) requires knowledge of m, along with 
<j x and either cr y or cr z . If we are given m and the underlying X and Z distributions 
that make up F, then we can use Eq. (6.6) to find r. But as mentioned earlier, we are 
usually just given a collection of data points in the x-y plane, without being given 
m or the exact X and Z distributions. How do we find r in that case? 


Covariance 


To find r if we are given a collection of data points, we need to define the covariance 
of two random variables. The covariance of X and F is denoted by Covi X, Y) and 
is defined to be 


Co v(X,Y) = E[(X - ii x )(Y - n y )\ 


(6.7) 


Note that if we set F equal to X , then the covariance of X and F (= X) simplifies 
toE[(X-n x ) 2 ], which from Eq. (3.19) is simply the variance of X. The covariance 
can therefore be thought of as a generalization of the variance. Like the correlation, 
the covariance gives a measure of how much two variables linearly depend on each 
other. In one extreme where F = X, we have Covi A, Y ) = <j\. In the other extreme 
where A and F are independent variables, we have Cov(A,F) = 0. This is true 
because for independent variables, Eq. (3.16) tells us that the expectation value in 
Eq. (6.7) equals the product of the expectation values E(X-/u x ) and E(Y-/u y ). And 
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these are equal to E(X ) - p x and E(Y) - p y , which are both zero by the definition 
of the p s. There is actually yet a further extreme, namely Y = - X. In this case we 
have Cov(X,T) = -cr 2 . 

In the situations we’ll be dealing with, we’ll usually take the means p x and p y 
to be zero, in which case Eq. (6.7) reduces to 


Cov(X,V) = E(XY) (if p x = p y = 0). (6.8) 


Having defined the covariance, we now claim that the definition of r in Eq. (6.6) 
is equivalent to 

_ Co y(X,Y) 

r = - (6.9) 

0~ x CTy 

To demonstrate this equivalence, we just need to replace Y with mX + Z in Eq. (6.9). 
We’ll assume that p x and p y (and hence p z , from Eq. (6.4)) are zero; the general 
case proceeds similarly. We obtain 


Cov(X,T) Co v(X,mX + Z) E[X(mX + Z)] 

o- x <T y o- x o-y o- x cr y 

mE(X 2 ) + E(XZ) mcr x + 0 mcr x 

CT x O-y CT x (Xy CTy 


( 6 . 10 ) 


which is the expression for r in Eq. (6.6). We have used the fact that since X and 
Z are independent, Eq. (3.16) allows us to write E(XZ) = E(X)E(Z) =0-0 = 0. 
We have also used Eq. (3.50) to say that E(X 2 ) = cr 2 , since p x = 0. The upshot 
here is that Eq. (6.9) reduces to Eq. (6.6) because Cov(Z,T) picks out the part of Y 
that comes from X and gets rid of the part that comes from Z. This leaves us with 
nicr 2 , so dividing by <j x cr y gives the desired ratio of standard deviations in Eq. (6.6). 
Note that neither Eq. (6.6) nor Eq. (6.9) requires that the underlying distributions be 
Gaussian. 

Compared with Eq. (6.6), the advantage of Eq. (6.9) is that it doesn’t involve 
m. Eq. (6.9) is therefore the one you want to use if you are simply given a set of 
data points (x,-,y,-) instead of the underlying distributions in Eq. (6.2). Although 
we defined Cov(Z,T) in Eqs. (6.7) and (6.8) for known distributions, Cov(x,~y) 
can also be defined for a set of data points. It’s just that instead of talking about 
the expectation value of XY (assuming that the means are zero), we talk about the 
average value of the xproducts, where the average is taken over all of the given 
(x,-,y,) data points. If we have n points (x,, v, j, then the covariance in the general 
case of nonzero means is 


Cov(x,y) s - V(x, - x)(y t - y) 

n ^ 


(for data points) 


( 6 . 11 ) 


If the averages x and y are zero, then the covariance is just the average of the prod¬ 
ucts x t y t , that is, Cov(x,y) = (l/n) £ x t y t . 

In defining r for a set of data points, the u x and <r y standard deviations in 
Eq. (6.9) are replaced with the s x and s y standard deviations from Eq. (3.60), calcu¬ 
lated for the specific sets of points, x,- and y,. So the correlation coefficient is given 
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by 


Cov(x,y) = £(*,-- x)(y t - y) 

Sxs y VEOi - T ) 2 VlKyi - V) 2 


(for data points) (6.12) 


Note that no factors of n remain in this expression, because the factor of n in 
Cov(x,y) (see Eq. (6.11)) cancels with the factors of sfn in each of s x and s y 
(see Eq'. (3.60)). 

If you are considering the n data points to be a subset of a larger population, 
then it is more appropriate to use the sample standard deviations s x and s y instead 
of s x and s y . The sample standard deviations are defined via the sample variance 
s 2 in Eq. (3.73) with an n - 1 in the denominator. Likewise, it is more appropriate 
to use the sample covariance, defined analogously with an n - 1 instead of an n 
in the denominator of Eq. (6.11). However, using these “sample” quantities (with 
n — 1 instead of n) doesn’t affect the final result in Eq. (6.12), because the n — 1 
factors cancel, just as the n factors did. The expression for r on the righthand side 
of Eq. (6.12) is therefore valid in any case. We’ll do an example involving r and 
Cov(jc, y) below on page 290. 

Remarks: 

1. We chose to initially define r by Eq. (6.6) instead of by Eq. (6.9) (which is more 
common in a practice), because Eq. (6.6) makes it clear what the meaning of r is. It is 
the fraction of cr y that can be attributed to X. If most of cr y comes from X and not Z, 
then X and Y have a high correlation. If most of cr y comes from Z and not X, then X 
and Y have a low correlation. 

2. The correlation coefficient r is independent of the means of X, Y, and Z. This follows 
from the fact that none of the quantities in Eq. (6.6) or Eq. (6.9) (m, cr x , cr y , cr z , or 
Cov(2l,T)) depend on the means. Changing the means simply shifts the whole blob 
of points around in the X-Y plane. 

3. The correlation coefficient r doesn’t depend on a uniform scaling of X or Y. That is, r 
doesn’t depend on a uniform stretching of the X or Y axes. This is true because if we 
define new variables X' = aX and Y' = bY (which imply p x i = ap x and p y t = bp y ), 
then you can quickly use Eq. (6.7) to show that Co v(X',Y') is larger than Cov(2f,L) 
by the factor ab. Likewise, cr x 'cr y ' is larger than cr x cr y by the same factor ab, from 
two applications of Eq. (3.41 ). The r in Eq. (6.9) therefore doesn't change. Basically, 
stretching each of the axes in a scatter plot by arbitrary amounts doesn’t change how 
well the value of X helps predict the value of Y. 

4. Eq. (6.9) is symmetric in X and Y. This means that if we switch the independent 
and dependent variables in a scatter plot and imagine X being partially dependent on 
Y (instead of Y being partially dependent on X), then the correlation coefficient is 
the same. This isn't terribly obvious, given the lack of symmetry in the relation in 
Eq. (6.2), where Z is independent of X, not Y. We’ll have more to say about this 
symmetry in Sections 6.5 and 6.7 below. 

5. From Eq. (6.6) we see that m can be written as m = rcr y /cr T . In terms of the covari¬ 
ance, m is therefore 



Cov(X, Y) 


Cov(X,Y) 


cr x cr y 


(6.13) 
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6. An alternative expression for the covariance in Eq. (6.7) can be derived by expanding 
the product (X - p x )(Y ~ Py )• Using E(X) = p x and E(Y) = p y , we have 

Cow(X,Y) = E[(X - p x )(Y - p y )] 

= E[XY - p y X - p x Y + p x py] 

= E{XY) - pyE(X) - p x E(Y) + p x p y 

= E(XY) — Py p x — PxPy + P x Py 

= E{XY)-p x p y . (6.14) 

This reduces to Eq. (3.34) when X = Y. * 


Examples with various r values 

Fig. 6.8 shows examples of scatter plots for six different values of r. All of the 
(numerically generated) plots have cr x = 2 and <r y = l, and there are 1000 points in 
each. Note that it takes a sizeable r to obtain a scatter plot that looks significantly 
different from the r — 0 case; the r = 0.3 plot looks roughly the same. The plots in 
this figure give you a visual sense of what a particular r means, so you should keep 
them in mind whenever you’re given an r value. If someone says, “The r value is 
0.7, and that seems pretty high, so I can be fairly certain of what Y will be, given 
X'’ then you will know that this person is mistaken. When r = 0.7, there is still a 
sizeable spread in the Y values for a given X. 

What is considered to be a “good” or “high” value of r? Well, that depends on 
what data you’re dealing with. If you’re a social scientist and you find an r - 0.7 
correlation between a certain characteristic and say, the number of months that a per¬ 
son has been unemployed, then that is a very significant result. You have just found 
a characteristic that helps substantially in predicting the length of unemployment. 
(But keep in mind that correlation does not necessarily imply causation. Although 
you have found something that helps in predicting, it might not help in explaining.) 
However, if you’re a physicist and you find a r — 0.7 correlation between the dis¬ 
tance d an object falls (in vacuum, dropped from rest) and the square of the falling 
time t, then that is a terrible result. Something has gone severely wrong, because 
the data points should (at least up to small experimental errors) lie on the straight 
line given by d — (g/2)t 2 , where g is the acceleration due to gravity. 

All of the plots in Fig. 6.8 have positive values of r. The plots for negative values 
look the same except that the blobs of points have downward slopes. For example, a 
scatter plot with r = -0.7 is shown in Fig. 6.9. Since r is negative, Eq. (6.6) implies 
that m is also, so Eq. (6.2) tells us that an increase in X yields a decrease in Y (on 
average). Hence the negative slope. 

In Figs. 6.3 though 6.7, the three specified parameters that were used to numer¬ 
ically generate the plots were 

cr x , cr z , m, (6.15) 

whereas in Figs. 6.8 and 6.9 the three specified parameters were 


cr x , o-y, r. 


( 6 . 16 ) 
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Figure 6.8: Scatter plots for various values of the correlation coefficient r. 



Figure 6.9: A scatter plot with negative correlation. 


Both sets of parameters contain the same information, expressed in different ways 
(although both sets contain cr x ). It is easy to go from one set to the other. Given the 
set in Eq. (6.15), the <r y and r values in Eq. (6.16) can be found via Eq. (6.6): 


cr. 


= V* 


+ cri 


and 



mcr x 


m 2 cr “ + cr\ 


(6.17) 
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For example. Fig. 6.7 was generated from the parameters m = 0.2, cr x = 2, and 
cr z = 1. So you can quickly show that Eq. (6.17) gives cr v = 1.08 and r = 0.37. 

To go the other way, the above expression for cr y can be rewritten as cr 2 = 
<Ty - m 2 o~ 2 . But Eq. (6.6) tells us that mrcr * = r 2 o~ 2 , so we obtain cr 2 = (1 - 
r 2 )a 2 . Therefore, given the set of parameters in Eq. (6.16), the cr- and m values in 
Eq. (6.15) can be found via 


cr_ = o- y V1 - r 2 


and 


m = 


ro-y 

cr* 


(6.18) 


For example, the r = 0.3 plot in Fig. 6.8 was generated from the parameters r = 0.3, 
cr* = 2 and cr y = 1. So Eq. (6.18) gives cr z = 0.95 and m = 0.15. 

From here on, we will usually describe scatter plots in terms of r (and cr* and 
cry) instead of in (and cr* and cr z ). But you can always switch back and forth 
between r and m by using Eqs. (6.17) and (6.18). However, we are by no means 
finished with m. This quantity is extremely important, in that it is the slope of the 
regression line , which is the topic of Section 6.7. 

As mentioned above, it is more common to be given a scatter plot, or equiva¬ 
lently a list of ( Xi,yi ) pairs, than it is to be given Eq. (6.2) along with the underlying 
distributions X and Z. So let’s explicitly list out the procedure for finding all of the 
parameters you might want to know, given a scatter plot of points. We’ll be general 
here and not assume that the means of X and Y are zero. Here are the steps: 

1. Calculate the means x and y of the X; and y, data points. 

2. Calculate the standard deviations s x and s y via Eq. (3.60). 

3. Calculate the covariance via Eq. (6.11). 

4. Calculate r via Eq. (6.12). 

5. Calculate m from Eq. (6.18), with the cr’s replaced with .f’s. 


Example: Consider the 20 points {X, Y) listed in Table 6.1 and plotted in Fig. 6.10. 
(These points don’t have any significance; I just made them up.) What is the correla¬ 
tion coefficient between X and T? 


X 

12 

7 

10 

3 

18 

13 

17 

6 

9 

12 

Y 

10 

13 

6 

4 

25 

14 

20 

7 

14 

15 


X 

13 

14 

5 

7 

16 

11 

8 

13 

15 

9 

Y 

18 

9 

7 

15 

26 

16 

12 

12 

17 

10 


Table 6.1: 20 points (X,Y). 
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Figure 6.10: The scatter plot of the points in Table 6.1. 


Solution: The quickest way to analyze the data is to use Excel or something similar, 
by making a column for the X values and another column for the Y values. Now, 
technically if we’re concerned only with the correlation coefficient r, then one Excel 
function, CORREL, gets the job done. But let’s pretend we don’t know that. Then to 
calculate r, we need to find the standard deviations s x and s y by using the STDEV.P 
function (they turn out to be 4.02 and 5.72), and also the covariance by using the 
COVARIANCE.P function (it turns out to be 17.45). The correlation coefficient r is 
then found via Eq. (6.12) to be r = 0.76. Eq. (6.18) then gives m = 1.08. The “.P” in 
these functions stands for “population.” 

Alternatively, you can use the STDEV.S and COVARIANCE.S functions, which have 
factors of n - 1 instead of n in the denominator. (The “.S” stands for “sample.”) You 
will obtain the same result for r, because all the (n - 1)’s cancel out in Eq. (6.12), just 
as the n’s do when using the “.P” functions. 

If you have access to only pencil/paper or a basic calculator, then the process will of 
course take longer. You will need to work through the whole list of steps preceding 
this example. The means x and y happen to be 10.9 and 13.5. 


6.4 Improving the prediction for Y 

As we have seen in a number of plots, the larger the correlation coefficient r is, 
the more the knowledge of a particular X value helps in predicting the associated Y 
value. In this section we will be quantitative about this. We will determine exactly 
how much the prediction is improved, given r. In the following discussion, we will 
assume that the X, Y, and Z distributions in the Y = mX + Z relation are all known. 

If we want to predict the value of Y without taking into account the associated 
value of X , then intuitively the most reasonable prediction is the mean value p y of 
the entire Y distribution. We’ll justify this choice below in Eq. (6.22). However, 
if we do take into account the associated value of X, and if there is a nonzero 
correlation between X and Y, then we can make a prediction for Y that is better 
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than the above p y prediction. What is the value of this better prediction, and how 
much better is it than p y ? In answering this, we’ll need to define what we mean by 
how “good” a prediction is. We’ll do this by instead defining how “bad” a prediction 
is. 


Considering the entire Y distribution 


Consider first the case where we are looking at the entire Y distribution. That is, 
we are not looking at a specific value of X. Imagine that we state our prediction 
for Y (call it y p ) and then pick a large number n of actual Y values y t -. We then 
calculate the variance of these values, measured relative to our prediction. That is, 
we calculate 





!= 1 


(6.19) 


In the limit where we pick an infinite number of y, values, the preceding expression 
becomes an expectation value. 


(r 0 - = E 


(i'-.ypV 


(badness of prediction) 


( 6 . 20 ) 


We will take this variance as a measure of how bad our prediction is. The larger the 
variance, the worse the prediction. If y p isn’t anywhere near the various y,- values, 
then our prediction is clearly a poor one. And consistent with this, the variance cr 2 
is large. We could, of course, choose a different definition of badness, but the one 
involving the variance in Eq. (6.20) is the standard definition. See the remark in the 
solution to Problem 6.10 for some discussion of this. 

Given the above definition of badness, the best y p prediction is the one that 
minimizes cr 2 . To find this y p value, we’ll need to do a little calculus and take the 
derivative of <r 2 with respect to y p , and then set the result equal to zero. If we 
expand the square in Eq. (6.20), we obtain 

C r p 2 = £[K 2 ]-2E[K]y p + y p 2 . (6.21) 

Setting the derivative with respect to y p equal to zero then gives 


-2 E[Y] + 2y p = 0 => y p - E[Y] = p y . (6.22) 

We see that cr 2 is minimized when y p equals the expectation value £[7], that is, the 
mean p y . We therefore want to choose the mean p y as our prediction. As mentioned 
above, this is probably what your intuition would have told you to do anyway! 


Best prediction = Mean 


(6.23) 


In the case of the y p = p y best prediction, the variance <x 2 in Eq. (6.20) is simply 
the actual variance cr 2 of the Y distribution. So cr 2 is a measure of how “bad” our 
best prediction is if we don’t take into account any information about the X value: 

<t 2 = badness of best guess p y , with no knowledge of X. 


(6.24) 
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Considering a specific value of X 

Now consider the case where we do take into account a specific value of X ; call it xo- 
Given xq, the only possible Y values y, that we can possibly pick when forming the 
variance in Eq. (6.19) are the ones in the shaded strip in Fig. 6.11. (In principle, the 
strip should be very thin.) In drawing a scatter plot that is centered at the origin like 
this, we are tacitly assuming that the means of X and Y are zero. This can always 
be accomplished by measuring X and Y relative to their means, if they happen to be 
nonzero. Eq. (6.4) then implies that Z also has a mean of zero. 



Figure 6.11: Given that X equals xq, the best prediction for Y is m.XQ (the upper-right white 
dot). This is a better prediction than the naive Y = 0 prediction (the lower-left white dot) 
relevant to the entire distribution. 


The mean of the Y values in the shaded strip is mx o, because Y = mX + Z, 
and because p z = 0. The best _v p prediction is therefore mx o- This true because 
the general result in Eq. (6.23) still applies. The logic leading up to that equation 
remains valid; it’s just that we’re now taking the expectation value of only the part of 
the Y distribution that lies in the shaded strip , instead of the complete Y distribution 
(the entire blob of points). In the present case where we are incorporating our 
knowledge of xo, the Y in Eq. (6.20) should technically be replaced with K, 0 , or 
some similar notation, to indicate that we are concerned only with Y values that are 
associated with (or nearly with) xo- 

In the shaded strip in Fig. 6.11, the variance <j 2 in Eq. (6.20) (measured relative 
to our >p prediction of mx o) equals E\(Y - mxo) 2 ]. But Y ~ mx o equals Z in the 
shaded strip, because Y = mX + Z. The variance cr 2 is therefore simply E [Z 2 ] = 
err. So err is a measure of how “bad” our best prediction is if we do take into 
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account the particular value of X: 

cr 2 - badness of best guess mxo, knowing that X = xq- (6.25) 


mx o is the Y value associated with the upper-right white dot in Fig. 6.11. Our earlier 
prediction of 0 (or more generally p y ) is the Y value associated with the lower-left 
white dot. 

Given our definition of badness in terms of the variance cr 2 in Eq. (6.20), the 
ratio of the variances associated with the above two predictions in Eqs. (6.24) and 
(6.25) is a measure of how much better one prediction is than another. That is, the 
ratio 


Var(Z) _ (t\ 
Var (Y) ~ ^2 


(6.26) 


is the desired measure of how much better our prediction is if we take into account 
the particular value of X. From Eq. (6.18) we know that cr z = cr v VI - r 2 , so the 
ratio in Eq. (6.26) equals 


= 1 


(improvement of prediction) 


(6.27) 


This ratio is the factor by which the variance of a large number of data points (mea¬ 
sured relative to our prediction) is reduced if we use our knowledge of X. For 
example, if r = 1, then the factor is 0. This makes sense. With perfect correlation, 
our prediction is perfect, so the variance is reduced to nothing. In the other extreme 
where r — 0, the factor is 1. This also makes sense. With no correlation, knowledge 
of the X value doesn’t help in predicting the Y value, so the variance isn’t reduced 
at all. If, say, r = 0.5, then 1 - r 2 — 0.75, which means that our prediction is 
only slightly improved (that is, the variance is only slightly reduced) if we use our 
knowledge of X. 

Note that since Eq. (6.27) involves the square of r, the sign of r doesn’t matter. 
The improvement factor 1 - r 2 is the same for, say, r = -0.5 and r = 0.5. This is 
clear from looking at a scatter plot. The only difference is that a positive-r blob of 
points tilts upward while a negative-/- blob tilts downward. 


6.5 Calculating p(x, j) 

The scatter plots in Fig. 6.8, along with most of the other scatter plots in this chapter, 
were generated numerically by using Gaussian distributions for X and Z. (From 
Problem 6.4, it follows that Y is also Gaussian.) A quick glance at the plots in 
Fig. 6.8 indicates that all of the blobs of points have ellipse-like shapes. And indeed, 
if X and Z are Gaussians, then the probability distributions in the plane are in fact 
exactly elliptical. By this we mean that if we look at all points in the x-y plane 
that have the same probability density p(x,y), then these points all lie on an ellipse. 
Since we’re dealing with a 2-D plane, p(x,y) is a probability density per unit area. 
That is, if we multiply p(x, y) by a small area in the plane, we obtain the probability 
of lying in that small region. 
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Let’s rigorously demonstrate the above ellipse claim. In this section, unlike in 
previous sections, the Gaussian assumption for X and Z will be necessary. We’ll 
start with the Y - mX + Z relation. Imagine picking a random value from the X 
distribution, along with a random value from the Z distribution. If we end up with a 
particular point (x, y ) in the plane, then we must have picked an X value of x and a 
Z value of y - nix. Since X and Z are independent variables, the joint probability of 
these two outcomes is simply the product of the individual probabilities. Of course, 
the probability of obtaining exactly a specific value of X or Z is zero, because we’re 
dealing with continuous distributions. We should therefore really be talking about 
probability densities and tiny areas in the plane. So to be formal, we can say that 
the probability of obtaining X and Y values that lie in a tiny area dx dy around the 
point ( x,y ) is 


p(x,y ) dx dy = (p(A = x) dx') ■ (p(Z = y - mx) dz), (6.28) 

where dz is the interval of z that corresponds to the dy interval of y. But since the 
coefficients of Y and Z in the relation Y = mX + Z are equal, dz is simply equal to 
dy (for a given x). Using the second Gaussian expression in Eq. (4.42) for p(x) and 
p(z), and assuming p x = p y = 0, Eq. (6.28) becomes 


p(x, y) dx dy 


1 


V27TC 


dx ■ 


1 


V2^ 


dy. 


(6.29) 


7TCT, 


Our goal is to produce an expression for p(x,y) that involves only x and y, 
without any reference to z. So we must get rid of the two <x z ’s in Eq. (6.29). Ad¬ 
ditionally, let’s get rid of m in favor of the correlation coefficient r. We can rewrite 
Eq. (6.29) as 

p(x,y ) = 


2jl cr x CT z 


x 1 (y - mx) 2 


2 cri 


From Eq. (6.18) we know that cr, = cr y V1 - r 2 and m — ro- y /<T x . So 


p(x,y) 


2n<j x <j y 'l\ 


: exp 


(>’ - ( ro- y lcr x )xy 


2 crl 


2(1 - r 2 )cr 


Let’s simplify the exponent here. We can rewrite it as 

,,2 

+ ’ 


1 


(1 - r 2 )x 2 y ~ 2 rxycr y /(T x + r-x-<T^/cr; 


2(1 — r 2 ) \ o"i 
The r 2 x 2 /cr 2 terms cancel, yielding 

1 


(X 


y 


2(1 -r 2 ) \o- 


x" y~ 

+ — 


2 rxy 


' x'J y 


Our final result for the joint probability density p(x,y) is therefore 


(6.30) 


(6.31) 


(6.32) 


(6.33) 


p(x,y) = 


2n<j x a y Vl - r 2 


exp - 


2(1 -r 2 ) \(t 2 a 2 


x 2 y 2 2rxy \ 

“ + - - 


C xC y 


(6.34) 
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If we multiply this p(x, y ) by a small area in the plane, we obtain the probability 
that a randomly chosen (X,Y) point lies in that area. The double integral of p(x,y) 
over the entire x-y plane must be 1, because the total probability is 1. If you want 
to explicitly verify this, you can “complete the square” of the quadratic function 
of y (or x) in the exponent of p(x,y). Of course, when you do this, you’11 just be 
working backward through the above steps, which means that you’11 end up with 
the p(x,y) in Eq. (6.29). But the double integral of that p(x,y) over the entire x-y 
plane is certainly 1, because it involves a Gaussian dx integral and a Gaussian dy 
integral, each of which we know is 1. (With the form in Eq. (6.29), the dy integral 
should be done first.) 

The result for p(x,y) in Eq. (6.34) contains the complete information of our 
setup. Everything we might want to figure out can be determined from p(x,y). It 
contains exactly the same (complete) information as our original description of the 
setup, namely that Y is given by mX + Z, where X and Z are Gaussian distributions 
with means of zero and standard deviations of cr x and cr z . 

Eq. (6.34) tells us that the curves of constant probability density are ellipses. 
This is true because the exponent in p(x,y) (which contains all of the x and y 
dependence; there is none in the prefactor) takes the form of Ax 2 + By 2 + Cxy. And 
we’ll just accept here the well-known fact that a curve described by the equation 
Ax 2 + By 2 + Cxy = D is an ellipse. If C = 0, then the axes of the ellipse are parallel 
to the coordinate axes. But if C is nonzero, then the ellipse is tilted. Since C oc r, 
we see that the ellipse is tilted whenever there is a nonzero correlation between X 
and Y. 

If the distributions for X and Z aren’t Gaussian, then the constant-p(x,y) curves 
aren’t ellipses. So whenever we talk about ellipses in the following sections, we are 
assuming that the underlying distributions are Gaussian. 

We now come to a very important point, which is so important that we’ll put it 
in a box: 


The probability density p(x,y) in Eq. (6.34) is symmetric in x and y. 


More precisely, if x and cr x are switched with y and <r y , then p(x,y) is unchanged. 
(We have used the fact that the expression for r in terms of the covariance, given in 
Eq. (6.9), is symmetric in x and y.) This symmetry of p(x,y) is by no means obvi¬ 
ous from looking at our original Y = mX + Z expression, because Z is independent 
of X and not Y, which makes things appear asymmetric. 

But given that we now know that p(x, y) is symmetric, let’s switch x and y in 
the Y = mX + Z relation and see what we get. The point here is that whatever 
relation we get, it must have the same probability distribution p(x,y) (that is, the 
same shape of the blob of points in the x-y plane), because p(x, y) is symmetric in 
x and y. To switch x and y in the relation Y = mX + Z, we must first make the x 
and y dependences explicit by writing m as ro- y lcr x , and cr z as cr v V1 - r 2 , from 
Eq. (6.18). The relation Y = mX + Z can then be written in the more explicit form. 





M 

< r(T y 
{ Cx ) 

| X + zZ^ 

! X-ind 


(6.35) 
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where we have indicated the standard deviation of the (^-independent) distribution 
Z. Switching the x’s and y’s gives (again using the fact that r is symmetric in x and 

v) 






rcr x 

V + 7 <T xd\-r ' 1 

1 + ^y-ind 

°~y ) 


The (new and different) Z here is independent of Y and has a standard deviation of 
cr x V1 - r 2 . The above Z notation might seem a bit awkward, but it is important 
to indicate the two ways in which the two Z’s differ (standard deviation, and which 
other variable they are independent of). 

So what did we just show? All three of the equations Eq. (6.34), Eq. (6.35), and 
Eq. (6.36) have equivalent information. Eq. (6.34) puts X and Y on equal footing, 

whereas Eq. (6.35) treats Y as being dependent on X and Z^X j ’ , and Eq. (6.36) 

treats X as being dependent on Y and Z£V^ 1-> . But they all say the same thing, 
and they all produce the same probability density p(x,y) and hence the same shape 
of the blob of points in the x-y plane. 

If you don’t trust the above symmetry reasoning, you can show that Eq. (6.34) 
follows from Eq. (6.36) by starting with Eq. (6.36) and then working through the 
same steps as in Eq. (6.29) through Eq. (6.34). Of course, you will quickly discover 
that redoing the algebra is unnecessary, because all you’re doing is switching x and 
y. Since the final result for p(x,y ) is symmetric in x and y, the switch doesn’t affect 
p(x,y). The expressions in Eqs. (6.35) and (6.36) will be critical when we discuss 
the regression lines in Section 6.7. 


6.6 The standard-deviation box 

Assuming that all of our variables are Gaussians with means of zero, consider the 
constant-p(x, y) ellipse shown in Fig. 6.12. If we are given nothing but the tilted 
ellipse in the figure, can we determine the value of m in the Y = mX + Z relation 
that produces this ellipse? Indeed we can, in the following manner. 


Y 



Figure 6.12: For a given value of X, the Y values on a constant-p ellipse are symmetrically 
located above and below the Y = mX point. 
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Recall that for a given value of X, the Y values are symmetrically distributed 
above and below the Y — mX point, because Y = mX + Z, and because Z is a 
Gaussian with mean zero. Since this holds for all values of X, the Y = mX line 
must be the line that bisects the vertical span between any two vertically-aligned 
points on the ellipse, as indicated in Fig. 6.12. 1 The Y = mX line must therefore 
intersect the ellipse at the ellipse’s leftmost and rightmost points, where the slope 
is vertical. This is true because if the slope weren’t vertical at an intersection point, 
then the distance d either above or below the intersection point would be zero, while 
the other distance below or above the point would be nonzero. 

We can check that the numbers work out in Fig. 6.12. We generated this ellipse 
by arbitrarily choosing cr x = (1.5)cr y and r - 0.56. If you plug these values 
into Eq. (6.34) and then set the exponent in p(x,y) equal to an arbitrary (negative) 
constant, you will produce an ellipse with the shape shown. (The common value 
of p(x,y ) associated with the ellipse doesn’t matter, because that just determines 
the overall size of the ellipse, and not the shape.) From Eq. (6.18) we know that 
m = rcr y /cr x , which gives m = 0.37 here. And this is indeed the slope of the tilted 
line in Fig. 6.12. 

Let’s now consider an ellipse with a particular value of p(x, y), namely p(x, y) = 
e~ 1 /2 p(0,0). We are now interested in the actual size of the ellipse. From Eq. (6.34) 
we know that p(0,0) = 1/{2ncr x cr y Vl - r 2 ), although this exact value won’t con¬ 
cern us. Only the relative factor of e~ 1 /2 will matter here. For all points on the 
p(x,y) = e~ l ^ 2 p( 0,0) ellipse (we’ll call this the “e -1 ^ 2 ellipse”), p(x,y) is smaller 
than its value at the origin by a factor of e~^ 2 . The exponent in Eq. (6.30) or 
Eq. (6.34) therefore equals -1/2. Any other factor would serve the purpose here 
just as well, but we’re picking e -1 ^ 2 because it parallels the one-standard-deviation 
probability density for the single-variable Gaussian distribution, e~ x ^ 1<T ! y[2ncr. 
If x = cr then p(x) = e~ 1/2 p( 0). 

What is the value of x at the rightmost point on the e~ 1 ^ 2 ellipse? Since we 
know from above that the line y = mx passes through this point, the second term 
in the exponent in Eq. (6.30) equals zero. The first term must therefore equal -1/2, 
which means that x is simply cr x . The same reasoning holds for the leftmost point, 
so the e -1 / 2 ellipse ranges from x = -u x to x = cr x . We will now take advantage 
of the fact that Eq. (6.34) is symmetric in x and y. This means that any statement 
we can make about x, we can also make about y. Therefore, by the same reasoning 
with x and y switched, the highest point on the ellipse has a y value of cr y , and 
the lowest point has a y value of -cry. So the “bounding box” around the e -1 ^ 2 
ellipse is described by the lines x = ±cr x and y = ±cr y . This box is called the 
“standard-deviation box” and is shown in Fig. 6.13. 2 

l lt isn’t so obvious that given an arbitrary tilted ellipse, the locus of points with this property is 
in fact a line. But the derivation of p(x, y) in Section 6.5 basically proves it. Just work backwards 
starting with the elliptical distribution in Eq. (6.34), and you will find in Eq. (6.30) that p(x,y) decreases 
symmetrically above and below the y = mx line. 

Alternatively, you can find the bounding box by using calculus and taking the differential of the 
exponent in Eq. (6.34). Setting the result equal to zero gives a relation between nearby points on a given 
ellipse. Setting dx = 0 then gives a relation between x and y at the leftmost and rightmost points, where 
the slope is infinite. You can plug this relation back into the expression for the ellipse to find the 
x values at the leftmost and rightmost points. Similar reasoning with dy = 0 gives the y values at the 
highest and lowest points. 
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Figure 6.13: The standard-deviation box. 


Note that nowhere in the preceding paragraph did we make use of any specific 
value of the correlation coefficient r (or equivalently, of in). This means that for 
given values of cr x and o- y , the e~ l/2 ellipse is always bounded by the same box, 
for any value of r. Different values of r simply determine the shape of the ellipse 
inside the box. Two examples are shown in Fig. 6.14. They both have the same cr x 
and cr y values (where cr x = (1.5)tr v ), but the r for the thin ellipse is about 0.93, 
while the r for the wide ellipse is about 0.19. We will discuss in the next section 
how to determine r from an ellipse and its standard-deviation box. Note that the 
two ellipses in Fig. 6.14 have different values of cr z ; they have the same cr y , so the 
different values of r lead to different values of <x z = cr y V1 - r 2 . from Eq. (6.18). 
The thin ellipse, which has a larger r , has a smaller <r z . 


Y 



Figure 6.14: The standard-deviation box for an ellipse depends only on cr x and cr y , and not 
on r. Different values of r yield different shapes of ellipses inside the box. 
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6.7 The regression lines 

Given a p(x,y ) = e~^ 2 p( 0,0) ellipse, there are a number of lines that we might 
reasonably want to draw, as shown in Fig. 6.15. 

• We can draw the y = mx line passing through the leftmost and rightmost 
points on the ellipse, along with the analogous line passing through the high¬ 
est and lowest points. These are the solid lines shown, and they are called re¬ 
gression lines. The reason for this name will become apparent in Section 6.8. 
These lines are very important. 

• We can draw the standard-deviation line passing through the corners of the 
standard-deviation box. This is the long-dashed line, with slope cr y /cr x . This 
line is somewhat important. 

• We can draw the line along the major axis of the ellipse. This is the short- 
dashed line. It might seem like this line should have some importance, being 
a symmetry axis of the ellipse. However, it actually doesn’t have much to do 
with anything in probability, so we won’t be concerned with it. 



Figure 6.15: The two regression lines (solid), the standard-deviation line (long-dashed), and 
the unimportant symmetry axis of the ellipse (short-dashed). 

How do the slopes of the two regression lines relate to the slope cr y /a x of the 
standard-deviation line? The slope of the lower 3 regression line is simply m, which 
equals rcr y lo- x from Eq. (6.18). Equivalently, the slope is the r<r y lcr x coefficient 
of X in Eq. (6.35). This slope is just r times the cr y /cr x slope of the standard- 
deviation line, which is about as simple a result as we could hope for. Similarly, 
Eq. (6.36) tells us that if we tilt our head sideways (so that the X axis is now vertical 
and the Y axis is horizontal), then the slope of the upper regression line is ro- x /cr y 


3 By “lower” we mean the line that is lower in the first quadrant. Likewise for the “upper” regression 
line. Of course, these adjectives are reversed in the third quadrant. So perhaps we should be labeling the 
lines as “shallower” and “steeper.” But we’ll go with lower and upper, and you’ll know what we mean. 
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(ignoring the sign), because this is the coefficient of Y in Eq. (6.36). This slope 
(with our head tilted) is simply r times the cr x l(Xy slope (with our head tilted) of the 
standard-deviation line, ignoring the sign. 

The two regression lines pass through the points of tangency between the ellipse 
and the standard-deviation box. So from the previous paragraph, we see that the 
tangency points are the same fraction r of the way from each of the coordinate axes 
to the upper-right corner of the box. This is shown in Fig. 6.16. Determining either 
of these (identical) fractions therefore gives us the correlation coefficient r. This 
conclusion checks in the extreme cases of r = 1 (perfect correlation, thin ellipse) 
and r = 0 (zero correlation, wide ellipse with axes parallel to the coordinate axes). 


Y 



Figure 6.16: The points of tangency between the ellipse and the standard-deviation box are 
the same fraction r of the way from each of the coordinate axes to the upper-right comer of 
the box. 

Note that the slope of the standard-deviation line is the geometric mean of the 
slopes of the two regression lines as they appear on the paper (with no head tilting). 
This is true because the slope of the upper regression line (with no head tilting) is 
the reciprocal of the slope with a tilted head, so it equals (T y jra x \ this is indicated 
above in Fig. 6.15. The geometric mean of the two slopes as they appear on the 
paper is then 



which is the slope of the standard-deviation line, as desired. 

Another way to determine r is the following. What is the y-intercept of the 
p(x,y) = e~ l ^ 2 p( 0,0) ellipse in Fig. 6.16? To answer this, we can use the fact 
that x equals zero on the y axis. So if we want the exponent in Eq. (6.34) to equal 
-1/2, as it does for all points on the ellipse, then since x = 0 we see that we 
need y = a y V1 - r 2 . This makes sense, because this is simply cr z for our original 

random variable Z, which we labeled as in Eq. (6.35). Said in another 

way, when x = 0 the exponent in Eq. (6.30) equals -1/2 when y = <r z . 
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By the same reasoning, the ^-intercept of the e 1 ^ ellipse is x = cr x V1 - r 2 , 

which is the <r z for the random variable Z^X 1- ' in Eq. (6.36). The intercepts 
are indicated in Fig. 6.17. Measuring either of these intercepts and dividing by the 
standard deviation along the corresponding axis gives Vl - r 2 , which gives r. This 
conclusion checks in the extreme cases of r — 1 and r = 0. 



Figure 6.17: The intersection points between the ellipse and the coordinate axes are the same 
fraction Vl - r 2 of the way from the origin to the sides of the standard-deviation box. 

The critical property of the regression lines is that they are the “lines of aver¬ 
ages.” We saw in Fig. 6.12 that the lower regression line bisects the vertical span 
between any two vertically aligned points on a constant-p ellipse. This follows from 

the fact that the j ' distribution in Eq. (6.35) is a (symmetric) Gaussian with 
zero mean. Similarly, the upper regression line bisects the horizontal span between 
any two horizontally aligned points on a constant-p ellipse. This follows from the 

fact that the Z^fX 1- ' distribution in Eq. (6.36) is a (symmetric) Gaussian with 
zero mean. Two vertical and two horizontal pairs of equal distances are shown in 
Fig. 6.18. 

We’ve been drawing constant-p(x.y) ellipses for a while now, so let’s return 
to a scatter plot (generated numerically from Gaussian distributions). Fig. 6.19 
illustrates the same idea that Fig. 6.18 does. If we look at an arbitrary vertical strip 
of points, the distribution within the strip is symmetric around the intersection of the 
strip with the lower regression line (at least in the limit of a large number of points). 
And if we look at an arbitrary horizontal strip of points, the distribution within the 
strip is symmetric around the intersection of the strip with the upper regression line 
(for a large number of points). The intersections are indicated by the large white 
dots in the figure. 

When dealing with, say, vertical strips, remember that it is the same r 

distribution that holds for all strips. This follows from Eq. (6.35). The only reason 
why the spread of points (relative to the regression line) appears to be smaller in the 
extremes of the plot is that there are fewer points with values of X out there. But 
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Figure 6.18: The regression lines bisect the horizontal or vertical spans between any two 
horizontally or vertically aligned points on the ellipse. 


Y 



Figure 6.19: The distribution of points within a vertical strip is symmetric around the in¬ 
tersection of the strip and the lower regression line. Likewise for a horizontal strip and the 
upper regression line. 
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given a value of X, the value of Y has the same distribution ' relative to 

the Y = mX point on the lower regression line. That is, the probability of obtaining 

Q- ^J — y2 

a certain value of Z x _ ind is independent of X, because this Z is assumed to be 
independent of X. 

Let’s reiterate where each of the two regression lines is relevant: 

• The lower regression line is relevant if you are considering X as the inde¬ 
pendent variable, with Y being dependent on X ; see Eq. (6.35). The lower 
regression line gives the average value of Y associated with each value of X. 

• Conversely, the upper regression line is relevant if you are considering Y as 
the independent variable, with X being dependent on Y\ see Eq. (6.36). The 
upper regression line gives the average value of X associated with each value 
of Y. 


Remarks: 

1. You might think it odd that the line that cuts through the middle of the vertical strips 
in Fig. 6.19 is different from the line that cuts through the middle of the horizontal 
strips. It might seem like a single line (perhaps the short-dashed symmetry axis in 
Fig. 6.15) should do the trick for both types of strips. And indeed, when there is a 
high correlation (r as 1), the two regression lines are nearly the same, so one line 
essentially does the trick. But in the small-correlation limit (r ~ 0), it is clear that 
two lines are needed. In the r = 0 case in Fig. 6.5, the lower regression line is the 
x axis (which cuts through the middle of any vertical strip), and the upper regression 
line is the y axis (which cuts through the middle of any horizontal strip). These are as 
different as two lines can be, being perpendicular. There is no way that a single line 
can cut through the middle of both the vertical and horizontal strips. This fact is true 
for all r except r = ±1, although it is most obvious for small r. 

2. Consider the lower regression line in Fig. 6.18 or Fig. 6.19. (The upper line would 
serve just as well.) At first glance, this line might look incorrect, because it doesn’t 
look “balanced” properly, in the sense that the symmetry axis of the ellipse is balanced, 
and this line is different from the symmetry axis. But that is fine. The important thing 
is that any vertical strip of points is cut in half by the line. This is not the case with the 
symmetry axis of the elliptical blob of points, which is (irrelevantly) balanced within 
the ellipse. 

3. If you are presented with some real data of (x,',>y) points in a scatter plot (as opposed 
to the above numerically-generated plots), you can draw the regression lines by cal¬ 
culating the various quantities in the steps enumerated on page 290. The slopes of the 
regression lines are given in Fig. 6.15 as ra y /a x and cr y lrcr x , except with the cr’s 
replaced with the J’s from Eq. (3.60). 

4. If all of the above figures, we have been assuming that the means p x and p y are zero. 
In the more general case where the means are nonzero, the regression lines intersect 
at the point (p x ,p y ), that is, at the middle of the blob of points. Equivalently, you can 
define new variables by X' = X - p x and Y' = Y - p y . The ( p x ,p y ) point in the X-Y 
plane becomes the origin in the X'-Y' plane. * 
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6.8 Two regression examples 

To get some practice with regression lines, let’s do two examples. In both exam¬ 
ples, we’ve chosen the X and Y variables to be IQ (Intelligence Quotient) scores. 
We’ve done this partly because IQ scores are easy and standard things to talk about, 
and partly because we want to draw some analogies between the two examples. 
However, in dwelling on IQs, we certainly don’t mean to imply that they’re terribly 
important. If you want to think of IQ as standing for something else like “Interesting 
Qualities” or “Illuminati Qualifications,” then by all means do! 


6.8.1 Example 1: Retaking a test 

A specific example 

Imagine that a large number of people have just taken an IQ test. Assume that the 
average score is 100, which is how an IQ test is designed. Consider all of the people 
who scored 130. The standard deviation of an IQ test is designed to be 15, so 130 
is two standard deviations above the mean. If this group of people takes another IQ 
test (or the same test, if we somehow arrange for them to have amnesia), is their 
average score expected to be higher than, lower than, or equal to 130? 

In answering this question, let’s make a model and specify the (reasonable) as¬ 
sumptions of the model. We’ll assume that each person’s score is partially deter¬ 
mined by his/her innate ability and partially determined by random effects (mis¬ 
reading a question, lucky guess, bad day, etc.). Although it’s hard to define “innate 
ability,” let’s just take it to be a person’s average score on a large number of tests. 
Our model therefore gives a person’s actual score Y on the test as their innate score 
X , plus a random contribution Z which we’ll assume is independent of X. So the 
distribution Z is the same for everyone. A person’s score is then given by Y = X + /. 
This is just our old friend Eq. (6.2), with m — 1. For the sake of making some nice 
scatter plots, we’ll assume that X and Z (and hence T, from Problem 6.4) are Gaus¬ 
sian. 

If we take the average of the equation Y = X + Z over a large number of tests 
taken by a given person whose innate ability has the value X = xo, we obtain 
p y — xq + p z . where the p y here stands for the average of the given person. But the 
person’s innate ability xq is defined simply as their average score p y over a large 
number of tests. We therefore conclude (basically by definition) that the mean of 
Z is p z = 0. And we might as well measure X and Y relative to their population 
means (which are both 100). So all of X, Y , and Z now have zero means. 

Given all of these (quite reasonable) assumptions, what is the answer to the 
question we posed above? Will the people who scored 130 on their first test score 
(on average) higher, lower, or the same on their second test? To answer this, let’s 
look at a scatter plot of some numerically-generated X and Y values for the first test, 
shown in Fig. 6.20. We have plotted 5000 points, relative to the p x = p y = 100 
averages. Since we’ve assumed continuous Gaussian distributions for X and Z, our 
{X,Y) points in the plane don’t have integer values, as they would on an actual test. 
But this won’t change our general results. 
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Y, test score 


Y = 2X 



X, innate ability 


Figure 6.20: The average X value of the people in the shaded strip corresponds to the vertical 
black line (defined by the intersection of the shaded strip and the upper regression line). The 
average score of these people on a second test is given by the Y value of the white dot shown 
on the lower regression line. 


To generate the plot in Fig. 6.20, we have arbitrarily assumed cr_ = cr x . (This is 
probably too large a value for cr z . In real life, the standard deviation of the random 
contribution Z is likely a fair bit smaller than the standard deviation of the innate 
ability X. But using a large value for cr, makes things clearer in the plot.) From 
Eq. (6.5) with m = 1 and cr z = cr x , we obtain <r y — V 2cr x . Since IQ tests are 
designed to have a standard deviation of 15, we have set cry = 15. This then implies 
that cr x and cr z are both equal to 15/ V2 ~ 10.6. These values, along with m = 1, 
were used to generate the plot. From Eq. (6.6), the correlation coefficient is 


mcr x 



1 • CTx 

V2 cr* 


1 

vl 


0.71. 


(6.38) 


The two regression lines are drawn. The slope of the lower line is just m = 1. 
And from Fig. 6.15 the slope of the upper line is cr y !rcr x = V2/(1 / V2) = 2. As a 
double check, the geometric mean of these two slopes is V1-2, which is correctly 
the slope of the standard-deviation line, (T y /cr x = V2, as we noted in Eq. (6.37). 

Now consider all of the people who scored 130 on the first test. Since we’ve 
subtracted off the mean score of 100 from the Y values, a score of 130 corresponds 
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to y\ = 30, or equivalently yi = 2<x v . (The subscript “1” refers to the first test.) 
Since the Y values in our model take on a continuum of values, no one will have 
a score of exactly 30, so let’s look at a small interval of scores around 30. This 
interval is indicated by the shaded strip in Fig. 6.20. There are about 70 points in 
this strip. So our goal is to predict the average score of the 70 people associated 
with these points when they take a second test. 

What is the average X value (innate ability) of these 70 points? Answering 
this question is exactly what the upper regression line is good for. As we noted in 
Fig. 6.19, the upper regression line passes through the mean of any horizontal strip 
of data points (on average). We are therefore concerned with the X value of the 
intersection of the horizontal strip and the upper regression line. The slope of the 
upper regression line is 2, so the intersection point (x aV g,>'i) satisfies y i/x avg = 2. 
Since y\ = 30, this gives x avg = 15. Therefore, 15 (or really 115) is the average 
innate ability X of the 70 people who scored 130 on the first test. 

We now claim that when these 70 people take a second test, their average score 
will simply be their average innate ability, 115. This is true because if we take the 
average of the Y — X + Z relation over the 70 people in the group, the Z values 
average out to zero (or rather, the expectation value is zero). So we are left with 
y aV g = x avg . (We should probably be using the notation y 2 ,avg, to make it clear that 
we’re talking about the second test, but we won’t bother writing the 2.) 

In the more general case where Y = mX + Z, taking the average of this relation 
yields 

favg — mx av g. (6.39) 

But mx avg is the height of the point on the lower regression line with an X value 
of x avg . To obtain this result graphically, just draw a vertical line through (x avg , >’i) 
and look at the intersection of this line with the lower regression line, indicated by 
the white dot in Fig. 6.20. The Y value of this dot is the desired average score on 
the second test. (Having determined the average second score of the 70 people, we 
might also want to determine the distribution of their scores. This is the task of 
Problem 6.6.) 

The answer to the question posed at the beginning of this section is therefore 
“lower than 130.” Additionally, given the various parameters we arbitrarily chose, 
we can be quantitative about how much lower the new average score of 115 is. 
Additively, it is 15 points lower. Multiplicatively, it is 1 /2 as high above the mean, 
100, as the original common score, 130, was. Note that since Eq. (6.38) gave r = 
1/ V2 in our setup, we have r 2 = 1/2. The agreement of these factors of 1/2 is no 
coincidence, as we will show below. 

General discussion 

Looking at Fig. 6.20, it is clear why the 70 people in the shaded horizontal strip 
have an average on the second test that is lower than 130. The upper regression line 
lies to the left of the lower regression line (in the upper righthand quadrant), so the 
intersection of the horizontal strip with the upper regression line lies to the left of its 
intersection with the lower regression line. Translation: the average innate ability 
X of the 70 people (which is given by the intersection of the horizontal strip with 
the upper line) is smaller than the innate ability that would correspond to a score 
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of 130 if there were no random Z effect (which is given by the intersection of the 
horizontal strip with the lower line). 

In fact, in Fig. 6.20 it happens to be the case that all 70 points in the strip lie 
to the left of, and hence above, the lower regression line. That is, they all involve 
positive contributions from Z. Of course, if we were to generate the plot again, 
there might be some points in the strip that lie to the right of the lower line (with 
negative contributions from Z). Or if we had 50,000 points instead of 5000, we 
would undoubtedly have some such points. But for any (large) total number of 
points, there will be more of them in the shaded strip that have positive Z values 
than negative Z values. 

The preceding observation provides an intuitive way of understanding why the 
average on the second test is lower than 130. Since Y — X + Z, there are two basic 
possibilities that lead to a score of Y = 130 on the first test: A person can have an 
innate ability X that is less than 130 and get lucky with a positive value of Z, or they 
can have an innate ability that is greater than 130 and get unlucky with a negative 
value of Z. The first of these possibilities is more likely, on average, because 130 
is greater than the mean of 100, which implies that there are more people with an 
innate ability of 130 - a than 130 + a (for any positive a), as shown in Fig. 6.21 
for a — 10. So more of the 130 scorers have an innate ability that is less than 
130, than greater than 130, consistent with what we observed in Fig. 6.20. In the 
end, therefore, the decrease in average score on the second test comes down to the 
obvious fact that a Gaussian has its peak in the middle and falls off on either side. 


<xp(x) 



Figure 6.21: There are more people with an innate ability of 130 -a than 130 + a. Ascoreof 
130 is therefore more likely to come from a person with a lower innate ability who got lucky, 
than from a person with a higher innate ability who got unlucky. 

Everything we’ve been saying is relevant to a score that lies above the mean of 
100. If we start with a score that is smaller than 100, say Y = 70, then all of the 
above conclusions are reversed. (Note that the number 70 here has nothing to do 
with the 70 people in the Y = 130 shaded strip in Fig. 6.20!) The average on the 
second test will now increase toward the mean. It will be 85 on average, by all the 
same reasoning. All of the action in Fig. 6.20 will now be in the lower-left quadrant 
instead of the upper-right quadrant. In this new scenario, there are more people in 
the Y = 70 group who have an innate ability X that is greater than 70 but who got 
unlucky, than the other way around. The general conclusion is that for any score on 
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the first test, the average score on the second test will be closer to the mean. This 
effect is called regression toward the mean. 

Let’s now prove a handy little theorem, which we alluded to above. This theo¬ 
rem allows us to be quantitative about the degree to which averages regress toward 
the mean. 

Theorem 6.1 Consider the group of people who score y i points above the mean on 
a test. If they take the test (or an equivalent one) again, then their average score on 
the second test will be y avg = r 2 y\ points above the mean (on average), where r is 
the correlation coefficient between the actual score and innate ability. 

This theorem is consistent with the above example, because the correlation coeffi¬ 
cient was r — 1/ y2, and the initial score of yi = 30 points above the mean was 
reduced (on average) on the second test to y avg = r 2 ■ 30 = 15 points above the 
mean. 

Proof: The proof is quick. We simply need to reapply the reasoning associated 
with Fig. 6.20, but now with general parameters instead of given numbers. If the 
shaded strip in Fig. 6.20 has a height yi, then from Fig. 6.15 the intersection of the 
strip with the upper regression line has an X value of x avg = ( rcr x /o- y )y l . This 
is the average X value of the people who score yi on the first test. On the second 
test, the Z values will average out to zero, so the lower regression line gives the 
desired average second-test score of the “y\ group” of people, via Eq. (6.39). With 
m = ruylcr x from Fig. 6.15 (we’ll work with a general m, instead of in = 1), 
Eq. (6.39) gives 


/ ra y \ l r<jy \ l rcr x \ 

y avg = mx w% ^ — j * avg = ( — j ( — j -Vt 


Javg = r 2 yi 


m (6.40) 


We stated the theorem in terms of a test-retaking setup with Y — X + Z, where m 
equals 1. But as we just showed, the theorem holds more generally with Y = mX+Z, 
where m is arbitrary. In such cases, the theorem can be stated (in a less catchy 
manner) as, “Given a large set of data points, consider all of the points whose Y 
value is yj. Let the average X value of these points be x avg . Then the average 
Y value associated with x avg is r 2 y i.” Or more succinctly, “The average Y value 
associated with the average X value of the points with Y = yi equals r 2 y i.” 

The theorem checks in two extremes. If r = 1, then all scores lie on the Y — X 
line, or more generally the Y — niX line. The random Z value is always zero, so 
all scores are exactly equal to the innate ability, or more generally exactly equal to 
mX. A given person always scores the same every time they take the test. Everyone 
who scored a 130 on the first test will therefore score a 130 on the second test 
(and all future tests). So y avg = (1 ) 2 yi, consistent with Eq. (6.40). In the other 
extreme where r = 0, Eq. (6.6) tells us that either m — 0 or cr x = 0. So cr y = cr z . 
Basically, everyone’s score is completely determined by the random contribution Z, 
which means that the scores of any given group of people on the second test will be 
random and will therefore average out to zero. So y avg = (0) 2 yi, again consistent 
with Eq. (6.40). 
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The above theorem provides a nice way of determining the correlation coeffi¬ 
cient r between the actual score and innate ability, without doing any heavy calcu¬ 
lations. Just take a group of people who score y\ points above the mean on the test, 
and then have them take the test (or an equivalent one) again. If their new average 
is javgi then r is given by 


Vavg — r yi 



(6.41) 


It’s rather interesting that r can be determined by simply giving a second test, with¬ 
out knowing anything about cr x , <x y , cr z , or ml 

The new property of r in Eq. (6.41) is one of many properties/interpretations of 
r that we’ve encountered in this chapter. Let’s collect them all together here. 


1. Eq. (6.6) tells us that r is (by definition) the fraction of cr y that can be at¬ 
tributed to X. 

2. Eq. (6.18) tells us that the slope m of the lower regression line is m — rcr y /cr x . 
This means that r is the ratio of the slope of the regression line to the slope 
of the standard-deviation line. This interpretation of r is evident in Figs. 6.15 
and 6.16. 

The preceding fact can be restated as: If we consider an X value that is n 
times <j x above the mean, then the expected associated Y value is rn times 
cr y above the mean. 

3. Eq. (6.27) tells us that 1 - r 2 is the factor by which the “badness” of a predic¬ 
tion of Y is reduced if we take into account the particular value of X. This is 
the same 1-r 2 term that appears in the cr y Vl - r 2 (= cr z ) length in Fig. 6.17. 

4. Eq. (6.40) tells us that if we consider the people who score y\ points above 
the mean on a test, their average score on a second equivalent test will be 
Javg = r 2 y\ points above the mean (on average). 


6.8.2 Example 2: Comparing IQ’s 

Consider the following setup. A particular school has an equal number of girls 
and boys. On a given day, the students form self-selecting girl/boy pairs. Assume 
that there is a nonzero correlation between the IQ scores within each pair. That is, 
students with a high (or low) IQ tend to pair up with other students with a high (or 
low) IQ, on average. 4 This is plausible, because students who are friends with each 
other (and thus apt to pick each other as partners) might have similar priorities and 
study habits (or lack thereof). 

The question we will pose here is the following. Consider all of the girls who 
have a particular IQ score, say 130. Will their boy partners have (on average) an IQ 
score that is higher than, lower than, or equal to 130? 

4 By “IQ” or “IQ score” here, we mean a student’s innate ability, or equivalently their average score 
on a large number of IQ tests. In this example, we aren’t concerned with the random fluctuations on each 
test, as we were in the above test-retaking example. 
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To answer this, let’s pick some parameters and make a scatter plot of the IQ 
scores. We’ll assume that both girls and boys have an average IQ of 100 and that the 
standard deviation for both is 15. And as usual, we’ll assume that the underlying 
distributions are Gaussian. 5 In order to numerically generate a scatter plot, we’ll 
need to pick a value of the correlation coefficient r. Let’s pick 0.6. The qualitative 
answer to the above question won’t depend on the exact value. The resulting scatter 
plot of 5000 points (it’s a big school!) is shown in Fig. 6.22. Each point is associated 
with a girl/boy pair. The x coordinate is the boy’s IQ, and the y coordinate is the 
girl’s IQ (relative to the average of 100). 


standard-deviation line 





IQboy 


Figure 6.22: IQ scores of girl/boy pairs. 


The analysis in the present setup is essentially the same as in the above test¬ 
retaking example. The horizontal shaded strip in Fig. 6.22 indicates all pairs in 
which the girl has an IQ within a small range around 130. (The vertical shaded strip 
isn’t relevant yet; we’ll use it below.) To determine the average IQ score (that is, 
the average x coordinate) of the boys in this group, we simply need to look at the 
intersection (indicated by the upper large solid dot) of the horizontal shaded strip 
with the upper regression line. As we know well by now, this is exactly what the 

5 We should emphasize that in real life, despite the central limit theorem (see Section 5.5), things are 
often not as clean as we might lead you to believe by always picking nice Gaussian distributions. But the 
qualitative results we obtain will generally still hold for messier distributions. 
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upper regression line is good for. This line passes through (on average) the middle 
of the distribution of points in any horizontal strip. 

From Fig. 6.15, we know that if we tilt our head sideways, the slope of the 
upper regression line is r times the slope of the standard-deviation line, which is 1 
here, because we are assuming cr x — cr y (- 15). So the (tilted-head) slope of the 
upper regression line is 0.6. Its intersection point (the upper large solid dot) with the 
horizontal strip at y = 30 therefore has an x value of (0.6)(30) = 18. Geometrically, 
in the horizontal strip, the solid dot is r as far to the right as the hollow dot. So 18 
(or rather 118) is the desired average IQ of boys who are in pairs where the girl has 
an IQ of 130. The answer to the question posed above is therefore “lower than 130.” 

What do we conclude from this? That girls are smarter than boys? Or that 
girls actively pick boys with a lower IQ? No, neither of these conclusions logically 
follow from our result. One way of seeing why is to note that we can apply all of 
the above reasoning to pairs where the girl has an IQ that is lower than the overall 
mean of 100. Let’s say 70 instead of 130. The relevant action will then take place 
in the lower-left quadrant, and we will find that the average IQ of boys in this group 
is higher than 70. It is 100 - 18 = 82. 

The fact of the matter is that there isn’t much we can conclude. The lower/higher 
results that we have found are simply consequences of randomness. There isn’t 
anything deep going on here. The reason why girls with an IQ of 130 are paired 
with boys with lesser IQs (on average) is the same as the reason why, in the above 
test-retaking example, people who scored 130 had (on average) an innate ability less 
than 130. A high number such as 130 is more likely to be the result of a low number 
(innate ability or boy’s IQ) paired with a positive random effect, than a high number 
paired with a negative random effect. And as we noted in Fig. 6.21, this is due to 
the simple fact that a Gaussian has its peak in the middle and falls off on either side. 

The above “lower than 130” answer when r = 0.6 is consistent with the answer 
in the extreme case where r = 0. In this case, if we look at all of the pairs with girls 
who have an IQ of 130 (or any other number, for that matter), the boys in these pairs 
will have an average IQ of 100. This is true because if there is no correlation, then 
knowledge of the girl’s IQ is of no help in predicting the boy’s IQ; it is completely 
random. In the other extreme where r — 1 (perfect correlation), all of the boys in the 
“130-girls” pairs will have an IQ of exactly 130, so their average will also be 130. 
The answer to the question is then “equal to 130.” But any degree of non-perfect 
correlation will change the answer to “lower than 130.” 

Let’s take our setup one step further. We found that girls with an IQ of 130 
are paired with boys who have an average IQ of 118. What if we now look at all 
boys with an IQ of 118? Can we use some sort of symmetry reasoning to say that 
these boys will be paired with girls whose IQ is 130, on average? No, because when 
we look at all boys with an IQ of 118 (plus or minus a little), this corresponds to 
looking at the vertical shaded strip in Fig. 6.22. This strip represents a different set 
of pairs from the ones in the horizontal shaded strip, which means that any attempt 
at a symmetry argument is invalid. We’re talking about a different group of pairs, 
so it’s apples vs. oranges. 

The average IQ of the girls in the pairs lying in the x — 18 (or really 118) 
vertical shaded strip is given by the intersection (indicated by the lower large solid 
dot) of the vertical shaded strip with the lower regression line. Again, this is exactly 
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what the lower regression line is good for. This line passes through (on average) the 
middle of the distribution of points in any vertical strip. 

From Fig. 6.15, we know that the slope of the lower regression line is r times 
the slope of the standard-deviation line, which is 1 here. So the slope of the lower 
regression line is 0.6. Its intersection point (the lower large solid dot) with the 
vertical strip at x = 18 therefore has a y value of (0.6)(18) = 10.8. (Note that 10.8 
equals r 2 ■ 30 = (0.6) 2 • 30. This is the same factor of r 2 that we found above in the 
test-retaking example.) Geometrically, in the vertical strip, the lower solid dot has r 
times the height of the hollow dot. So 10.8 (or rather 110.8) is the desired average 
IQ of girls who are in pairs where the boy has an IQ of 118. But as above, we 
can’t logically conclude that boys are smarter than girls or that boys actively pick 
girls with a lower IQ. The smaller average is simply a consequence of the partially 
random nature of girl/boy pairings. 

As mentioned above, the calculations in this example are essentially the same 
as in the above test-retaking example. This is evidenced by the fact that Fig. 6.22 
has exactly the same structure as Fig. 6.20, although we didn’t draw the standard- 
deviation line (with a slope of \[2 ) in Fig. 6.20. 


6.9 Least-squares fitting 

In Section 6.4 we saw that the lower regression line yields the best prediction for Y, 
given X. We’ll now present a different (but very much related) interpretation of the 
lower regression line. 

Assume that we are given a collection of n points (x/, v, ) in the plane, for exam¬ 
ple, the 20 points we encountered in Fig. 6.10. How do we determine the “best-fit” 
line that passes through the points? That is, how do we pick the line that best de¬ 
scribes the collection of points? Well, the first thing we need to do is define what 
we mean by “best.” Depending on what definition we use, we might end up with 
any of a variety of lines, for example, any of the four lines in Fig. 6.15. 

We’ll go with the following definition: The best-fit line is the line that minimizes 
the sum of the squares of the vertical distances from the given points to the line. For 
example, in Fig. 6.23 the best-fit line of the given 10 points is the line that minimizes 
the squares of the 10 vertical distances shown. Other definitions of the best-fit line 
are possible, but this one has many nice properties. The seemingly simpler definition 
involving just the sum of the distances (not squared) has drawbacks; see the remark 
in the solution to Problem 6.11. 

How do we mathematically determine this “least-squares” line, given a set of 
n points (x t -,y t -) in the plane? If the line takes the form of y = Ax + B, then the 
vertical distances are y,; - (Ax, + B)\. So our goal is to determine the parameters 
A and B that minimize the sum, 

n 

S = ^[y t -(Ax,+B)] 2 . (6.42) 

l 


This minimization task involves some straightforward but tedious partial differen¬ 
tiation. If you don’t know calculus yet, you can just skip to Eq. (6.46) below; the 
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Y 



Figure 6.23: The best-fit line is the line that minimizes the sum of the squares of the vertical 
distances from the given points to the line. 


math here isn’t terribly important. But the end result in Eq. (6.47) is very important. 
And it is also rather pleasing; we will find that the “least-squares” line is none other 
than the regression line! More precisely, it is the lower regression line, with slope 
m = rcr y /cr x . See Fig. 6.15. 

Let’s now do the math. If we expand the square in Eq. (6.42), we can write S as 

5 = ^ y 2 _ 2 2 yi (Ax, + B) + £(Ax; + B) 2 
= ^ T, 2 - 2A ^ x - 2B ^ y t + A 2 ^ x] + 2AB ^ x t - + nB 2 . (6.43) 

All of the sums go from 1 to n. Since the (x,,y,) points are given, these sums are 
all known quantities. S is therefore a function of only A and B. To minimize this 
function, we must set the partial derivatives of S with respect to A and B equal to 
zero. This yields the following two equations: 


ff=° => q = -YjW + a Yj x ^ +B Yj xi ' 

— =0 => 0 = -^y, + A^Xi +nB. (6.44) 

Solving for B in the second equation and plugging the result into the first gives 


o = ~^j x iyi 





Xi. 


Solving for A (which is the slope of the y - Ax + B line) gives 


» E xtyt -'Lxiljyi 
” 2 > 7 - Q >‘') 2 


(6.45) 


(6.46) 


We can make this expression look a little nicer by multiplying both the numerator 
and denominator by 1 In 2 . We’ll use brackets ( ) here to denote an average, so 
(Yj x i)/n = (x), ( Yj x iyi)/n = (xy), etc. (We’re changing notation from our usual x. 
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xy, etc. notation for an average, because that would make the following expression 
for A look rather messy and confusing.) The factors of n all work out nicely, and 
we obtain the clean result. 


(xy) - (x)(y) = Cov(.Y,y) = rsy_ 
(x 1 ) - (x) 1 s 2 s x 


(6.47) 


The second expression here follows from the results of Problem 6.1, and the third 
expression follows from Eq. (6.12), which yields Cov(x,y) - rs x s y . Our result for 
A is about as simple as we could have hoped for, given the messiness of the original 
expression for S in Eq. (6.43). 

The slope A in Eq. (6.47) takes the same form as the slope m in Eq. (6.13), 
with the distribution covariance Cov(X,T) replaced with the data-point covariance 
Cov(x,y), and with the distribution standard deviations cr x and <r y replaced with 
the data-point standard deviations s x and s y . The difference between these two 
results is that the slope m in Eq. (6.13) was based on given distributions for X and Y 
(equivalently, it was based on an infinite number of data points), whereas the slope 
A in Eq. (6.47) is based on a finite number of data points. But when the number of 
points is large, the distributions of x and y values mimic the underlying X and Y 
distributions, so Cov(x,y) approaches Cov(X,T), and the s’s approach the cr’s. The 
slope A of the least-squares line therefore approaches the slope m of the regression 
line for the complete distribution, as we promised above. 



(large number of data points) 


(6.48) 


This is a splendid result, although not so surprising. Both the least-squares line 
in the present section and the (lower) regression line in Section 6.4 minimize the 
sum of the squared vertical distances from the line. The latter is true because (as we 
saw in Section 6.4) it is true for any vertical strip. 

To solve for B (the v-interccpt of the y = Ax + B line), we can plug the A from 
Eq. (6.46) into either of the equations in Eq. (6.44). The second one makes things 
a little easier. After simplifying and rearranging the factors of n as we did when 
producing the A in Eq. (6.47), we obtain 


(y)(x 2 ) - (x)(xy) 
(x 2 ) - (x) 2 


(y) - A{x) 


(6.49) 


The second expression here is derived in Problem 6.8. Note that B is zero if the 
averages (x) and (y) are zero. Because of this, we usually aren’t too concerned with 
B, since we can always arrange for it to be zero by measuring x,- and y,- relative 
to their means (as we have generally been doing with the X and Y distributions 
throughout this chapter). In this case, the best-fit line passes through the origin, and 
the only parameter needed to describe the line is the slope A. This line is the same 
as the lower regression line with slope m (assuming a large number of data points). 

In the above derivations of A and B, we treated y as being dependent on x. But 
what if we’re just given a blob of points in the plane, with x and y treated on equal 
footing? There is then no reason why vertical distances should be given preference 
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over horizontal distances. It is therefore just as reasonable to define the best-fit line 
as the line that minimizes the sum of the squares of the horizontal distances from 
the given points to the line. Due to all the symmetry we’ve seen earlier in this 
chapter, it’s a good bet that this line will turn out to be the upper regression line. 
And indeed, if we describe the best-fit line by the equation x = Cy + D, then the 
sum of the squares of the horizontal distances is 

n 

S = J^[ Xi -(Cy i+ D)] 2 . (6.50) 

l 


To find the value of C that minimizes this sum, there is no need to go through all 
of the above mathematical steps again, because all we’ve done is modify Eq. (6.42) 
by interchanging x and y, replacing A with C, and replacing B with D. C is there¬ 
fore obtained by simply letting x <-> y in Eq. (6.47), which gives (using the fact that 
Cov(y,x) = Cov(x,y)) 


C = 


Cov(x,y) 



(6.51) 


D is found similarly by switching x and y in Eq. (6.49). But if we assume that 
the means (x) and (y) are zero, then D equals zero, just as B does. The x = Cy + D 
line therefore takes the form of x = ( rs x /s y )y , which has the same form as the 
upper regression line in Figure 6.15, namely X = (rcr x /cry)Y. 

So which of the two least-squares lines is the actual best-fit line? Is it the one 
involving vertical distances, or the one involving horizontal distances? Well, if 
you’re considering y to be dependent on x, then the lower regression line is the 
best-fit line. It minimizes the variance of the y,- values measured relative to the 
Axi + B values on the line. Conversely, if you’re considering x to be dependent 
on y, then the upper regression line is the best-fit line. It minimizes the variance 
of the Xi values measured relative to the Cy,- + D values on the line. Note that if 
you’re given an elliptical blob of points in the x-y plane, you might subconsciously 
think that the best-fit line that serves both of the preceding purposes is the symmetry 
axis of the ellipse (the short-dashed line in Figure 6.15). But this line in fact serves 
neither of the purposes. 


Remark: Continuing the discussion following Eq. (6.49), let’s talk a little more about mea¬ 
suring Xj and v,- relative to their means. Since the second expression for B in Eq. (6.49) tells 
us that the y = Ax + B line takes the form of y = Ax + «y) - A(x)). it immediately follows 
that the point «x),(y)) satisfies this equation. That is, the point «x),(y)) lies on the line. 
(This is no surprise, and you might have just assumed it was true anyway.) Therefore, if we 
shift the origin to the point «x),(y)), then the least-squares line is the line passing through 
the origin with a slope given by the A in Eq. (6.47). 

Note that measuring x and y,- relative to their means doesn't affect A, because A is 
independent of the averages ( x ) and (y). This is intuitively clear; shifting the blob of points 
and the best-fit line around in the plane doesn’t affect the distances to the line, so it doesn’t 
affect our derivation of A. Mathematically, this independence follows from the fact that both 
Cov(x.y) and s x in Eq. (6.47) are independent of (x) and (y). This is true because the 
expressions in Eq. (6.11) and Eq. (3.60) involve only the differences between x, values and 
their mean x (likewise for y). And shifting all of the x,- values by a fixed amount changes x 
by this same amount, so the differences x,- - x are unaffected. * 
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6.10 Summary 


• Let Y be given by Y - mX + Z, where X and Z are independent variables. 
Then the correlation coefficient r between X and Y is defined as the fraction 
of the standard deviation of Y that comes from X. It is given by 


ni(T x mcr x 

<Ty yjm 2 crl + <r 2 


(6.52) 


It can also be written as 

Cov(Z.T) 
r = - 

c x cr y 

where the covariance is defined as 


(6.53) 


Cov(X,T) s E[(X - /i x )(Y - n y j]. (6.54) 


• If you are instead just given a collection of data points in the x-y plane, with¬ 
out knowing the underlying distributions, then Eq. (6.53) turns into 


= Cov(v, y) = Z(x,- - xXyi_ - y) 

Sxs y yflXxi ~x) 2 VECl- -y) 2 


(6.55) 


• The higher the correlation, the greater the degree to which knowledge of X 
helps predict Y. A measure of how much better the prediction is (compared 
with the naive guess of the mean of Y) is 


(T% 0 

-f- = 1 -r 2 . 
cr 2 


(6.56) 


Given r, along with cr x and <x v , the probability density in the x-y plane is 


p(x,y) 


2no- x o- y Vl - r 2 


exp 


2 rxy \ 


2(1 — r 2 ) \ cr 2 crt cr x cr. 


ir 

(6.57) 


This density is symmetric in x and y (and <r x and cr y ). 


• There are two regression lines. If Y is considered to be dependent on X. then 
the lower regression line is relevant. This line gives the average value of Y 
for any given X. Its slope is m = ro- y /cr x . If instead X is considered to be 
dependent on Y, then the upper regression line is relevant. This line gives the 
average value of X for any given Y. Its slope is o- y /rcr x . 


• If a group of people score yi points above the mean on a test, and if they take 
the test (or an equivalent one) again, then their average score on the second 
test will be y avg = r 2 vi points above the mean (for a large number of data 
points), where r is the correlation coefficient between the actual score and 
innate ability. Since r < 1, the average new score is closer to the mean than 
the old score was (except in the r = 1 case of perfect correlation, where it is 
the same). This effect is known as regression toward the mean. 
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• Given a set of data points in the x-y plane, the best-fit line is customarily 
defined as the least-squares line. This line has slope 


A = 


Cow(x,y) 



(6.58) 


which takes the same form as the slope m given in Eq. (6.13) for the lower 
regression line. 


6.11 Exercises 

See www.people.fas.harvard.edu/~djmorin/book.html for a supply of problems 
without included solutions. 


6.12 Problems 

Section 6.3: The correlation coefficient r 

6.1. Alternative forms of Cov(x,y) and s * 

(a) Show that the Co v(x,y) defined in Eq. (6.11) can be written as (xy) - 
(x)(y). ((x) means the same thing as T.) 

(b) Show that the s 2 defined in Eq. (3.60) can be written as (x 2 ) - {x) 2 . 

6.2. Rescaling X ** 

Using Eq. (6.9), we showed in the third remark on page 287 that the correla¬ 
tion coefficient r doesn’t change with a uniform scaling of X or Y. Demon¬ 
strate this again here by using the expression for r in Eq. (6.6). 

6.3. Uncorrelated vs. independent ** 

If two random variables X and Y are independent, are they necessarily also 
uncorrelated? If they are uncorrelated, are they necessarily also independent? 

Section 6.5: Calculating p(x,y ) 

6.4. Sum of two Gaussians *** (calculus) 

Given two independent Gaussian distributions X and Y with standard devia¬ 
tions cr x and cr y , show that the sum Z = X + Y is a Gaussian distribution with 

standard deviation + cr 2 . You may assume without loss of generality 
that the means are zero. 

6.5. Maximum p{x,y) * (calculus) 

For a given yo, what value of x maximizes the probability density p(x, yo ) in 
Eq. (6.34)? * 
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Section 6.8: Two regression examples 

6 . 6 . Distribution on a second test ** 

Consider the 70 people who scored (roughly) 130 on the IQ test in the exam¬ 
ple in Section 6.8.1. If these people take a second test, describe the distribu¬ 
tion of the results. You can do this by finding the cr x , cr y , cr z , m, r, p x , and 
p y values associated with the scatter plot of the (X,Y) values. 

6.7. One standard deviation above the mean ** 

Assume that for a particular test, the correlation coefficient between the score 
Y and innate ability X is r. Consider a person with an X value that is one 
standard deviation cr x above the mean. What is the probability that this person 
scores at least one standard deviation cr y above the mean? Assume that all 
distributions are Gaussian. (To give a numerical answer to this problem, you 
would need to be given r. And you would need to use a table or a computer. 
It suffices here to state the value of the standard deviation multiple that you 
would plug into the table or computer.) 

Section 6.9: Least-squares fitting 

6 . 8 . Alternate form of B * 

Show that the second expression for B in Eq. (6.49) equals the first. 

6.9. Finding all the quantities ** 

Given five (X.Y) points with values (2,1), (3,1), (3,3), (5,4), (7,6), calcu¬ 
late (with a calculator) all of the quantities referred to in the five steps listed 
on page 290. Also calculate the B in Eq. (6.49), and make a rough plot of the 
five given points along with the regression (least-squares) line. 

6.10. Equal distances ** (calculus) 

In Section 6.9 we defined the best-fit line as the line that minimizes the sum 
of the squares of the vertical distances from the given points to the line. Let’s 
kick things down a dimension and look at the 1-D case where we have n 
values Xj lying on the x axis. We’ll define the “best-fit” point as the value of 
x (call it Xb) that minimizes the sum of the squares of the distances from the 
n given x, points to the Xb point. 

(a) Show that Xb is the mean of the x; values. 

(b) Show that the sum of all the distances from Xb to the points with x,- > Xb 
equals the sum of all the distances from Xb to the points with x,- < Xb- 

6.11. Equal distances again ** (calculus) 

Returning to 2-D, show that the sum of all the vertical distances from the 
least-squares line to the points above it equals the sum of all the vertical dis¬ 
tances from the line to the points below it. Hint: Consider an appropriate 
partial derivative of the sum S in Eq. (6.42). 
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6.13 Solutions 


6.1. Alternative forms of Co v(x,y) and s 


(a) Starting with the definition in Eq. (6.11), we have 

Cov(x,y) = - V (xj - (x))(yi - (y)) 

n t—L 

= ^ ( Yj x iyt - X Xi< ' y ' > ~ Yj + n( ' x ^y'>) 


= Zxm_YcH {y) _Ycn {x) + {x){y) 

n n n 

= {xy) - <x><y> - <y>(x> + <x)<y) 

= {xy) - {x){y). 


(6.59) 


as desired. In the limit of a very large number of data points, the above aver¬ 
ages reduce to the expectation values for the underlying distributions. That is, 
(xy) —> E(XY), (x) —> E(X) = p x , and (y) —> E{Y) = p y . The above result 
therefore reduces to Eq. (6.14). 

(b) Starting with the definition in Eq. (3.60), we have 


= - Yj ( Xi ~ W) 2 
= l(Tj x ^~ 2 Tj xi{x)+ " {x)2 ) 

= — - 2^{x) + {x) 2 
n n 

= (x 2 ) - 2{x) 2 + (x) 1 
= (x 2 ) - (x) 2 . 


(6.60) 


as desired. As in part (a), in the limit of a very large number of data points, the 
above averages reduce to the expectation values for the underlying distributions. 
The above result therefore reduces to cr 2 = E{X 2 ) - p 2 , which is equivalent 
to Eq. (3.50). Eq. (6.60) is a special case of Eq. (6.59), when x = y. More 
precisely, when each y, equals the corresponding Xj, the covariance reduces to 
the variance. 


6.2. Rescaling X 

If we let X’ = aX and Y’ = bY, what form does the Y = mX + Z relation in Eq. (6.2) 
take when written in terms of X ' and I"? We need to generate some X’ and I" (that 
is, some aX and bY) terms, so let’s multiply Y = mX + Z through by b, and let’s also 
multiply the mX term by 1 in the form of a la. This gives 

ffi bin 

bY = b-aX + bZ => Y’ = —X' + bZ ==> Y’ = m’X’ + bZ , (6.61) 

a a 

where m’ = bm/a. Note that Eq. (3.41) tells us that cr x > = acr x and <xy = ba y . 
Using the expression for r in Eq. (6.6), the correlation coefficient r’ between X’ and 
Y’ is then 

, m'cr x > ( bm/a){acr x ) mcr x 

r = - = --- = - = r, (6.62) 

(T y / b(T y CT y 

as desired. Fig. 6.24 shows a scenario with a = 2 and b = 1. In the first plot, we have 
chosen cr x = 1, cr y = 1, with r - 0.8. So the second plot has cr x = 2, cr y = 1, with r 
again equaling 0.8. 
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Figure 6.24: The second plot is obtained by stretching the first plot by a factor of 2 
in the X direction.The correlation coefficients for the two plots are the same. 


6.3. Uncorrelated vs. independent 

Assume that the two variables are independent. Then we know from Eq. (3.16) that 
the expectation value of the product equals the product of the expectation values. The 
covariance in Eq. (6.14) (the expression in Eq. (6.7) would work just as well) therefore 
becomes 

Cov(A,T) = E{XY) - n x ny = E(X)E(Yi - / i xi i y = 0, (6.63) 

because [i x = E(X) and [i y = E(Y). The correlation coefficient r in Eq. (6.9) is 
then zero. The answer to the first question posed in this problem is therefore “yes.” 
That is, if two random variables X and Y are independent, then they are necessarily 
also uncorrelated. In short, the logic comes down to the fact that P(x,y) = P(x)P(y) 
(which is the condition for independence; see Eq. (3.10)) implies via Theorem 3.2 that 
E(XY) = E(X)E(Y) (which is the condition for Cov(A,T) = 0; see Eq. (6.14)). 

Now assume that the two variables are uncorrelated. It turns out that they are not 
necessarily independent. That is, E(XY) = E(X)E(Y) does not imply P(x,y ) = 
P{x)P{y). The quickest way to see why this is the case is to generate a counterex¬ 
ample. Let A be a discrete random variable taking on the three values of -1, 0, and 
1 with equal probabilities of 1/3. And let Y = |A|. Then the three points in the X-Y 
plane shown in Fig. 6.25 all occur with equal probabilities of 1/3. 


Y 


• 

1 ■ 

• 

-1 

1 


Figure 6.25: If the three points shown have equal probabilities of 1/3, then X and Y 
are uncorrelated and dependent. 

You can quickly show that E(XY) and E(X) = fj x are both equal to zero, which means 
that the Cov(A,T) in Eq. (6.63) is zero. (Consistent with this, we have E(XY) = 
E(X)E(Y), with the common value being zero.) Therefore r = 0. However, the 
P(x,y) = P(x)P(y ) condition for independence is not satisfied, because, for exam- 
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pie, P( 0,0) = 1/3 whereas P(0)P(0) = (l/3)(l/3) = 1/9. Intuitively, the variables 
are clearly dependent, because if X = 0 then Y is guaranteed to be 0; so Y certainly 
depends on X. The variables X and Y are therefore (linearly) uncorrelated but depen¬ 
dent. 

For a counterexample involving continuous variables, let X be uniformly distributed 
from -1 to 1, and let Y = X~. Then you can quickly show that E(XY) and E(X) are 
both equal to zero, which implies that r = 0. But X and Y certainly depend on each 
other. 

To sum up: If two random variables are independent, then they are uncorrelated. The 
contrapositive of this statement is also true: If two random variables are correlated, 
then they are dependent. However, the converses of the preceding two statements 
are not valid. That is, if two random variables are uncorrelated, then they are not 
necessarily independent. And if two random variables are dependent, then they are 
not necessarily correlated. These results are summarized in Table 6.2, which in¬ 
dicates which combinations are possible. The only impossible combination is cor¬ 
related/independent. Remember that throughout this chapter, we are always talking 
about linear correlation. 


Independent Dependent 


Uncorrelated 


Correlated 


YES 

YES 

NO 

YES 


Table 6.2: Relations between (un)correlation and (in)dependence. 


6.4. Sum of two Gaussians 

There is some overlap between this calculation and the one we did in Section 6.5 
when we derived p(x,y). We could actually make use of that result to save us some 
time here, but let’s work things out from scratch to get some practice. The solution 
we'll give here is a standard one involving integration. We’ll be a bit pedantic. Many 
treatments skip the initial material here and effectively just start with Eq. (6.68); see 
the fourth remark below. If you don't like the following (somewhat involved) solution, 
we'll present a slick geometric solution in the fifth remark below. 

Since X and Y are independent variables, the joint probability of picking an X value 
that lies in a little span dx around x and a Y value that lies in a little span dy around y 
equals the product of the probabilities, that is, (p x (x) dx) ( p y (y) dy). In other words, 
the probability p(x,y) dx dy (by the definition of p(x,y)) of picking X and Y values 
that lie in a little area dx dy around the point )x,y) equals 

p(x,y) dx dy = p x (x)p y (y) dx dy. (6.64) 

Now, a line described by the equation x + y = C, where C is a constant, has a slope 
of -1. Therefore, the shaded strip in Fig. 6.26(a) shows the values of X and Y that 
yield values of Z = X + Y that lie within a range Az around a given value of z. (We’ll 
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assume that z corresponds to the value at the middle of the strip, although it doesn’t 
matter exactly how z is defined if Az is small.) The total probability of obtaining a 
point (x,y) that lies in the shaded strip is found by integrating the above expression 
for p(x,y) over the strip: 

/The in strip) = I p(x,y)dxdy= I p x (x)p y (y) dx dy. (6.65) 

J strip J strip 



Figure 6.26: (a) The shaded strip indicates the values of X and Y that yield values of 
Z = X + Y that lie within a range Az around z. (b) A zoomed-in view of the shaded 
area, divided into thin rectangles with width dx and height Az. 


Since the probability of obtaining an (x,y) point that lies in the strip is the same as the 
probability of obtaining a Z value that lies within a range Az around z, and since the 
latter probability is p z (z) Az by definition (assuming Az is small), we have 

/The in strip) = p z (z) Az. (6.66) 

Our goal is therefore to calculate the integral in Eq. (6.65) and then equate it with 
p z (z) Az. This will give us the distribution p z (z), which we will hnd has the desired 
Gaussian form. 

In Fig. 6.26(b) we have divided the shaded strip into thin rectangles, each with width 
dx and height Az. We will assume here that dx is much smaller than Az, so in the 
dx — » 0 limit the thin rectangles exactly cover the strip. Since Z = X + Y , the y 
value in a given rectangle is y = z - x. The y value actually varies by Az within each 
rectangle, but since Az is small, we can say that v is essentially equal to z - x over 
the whole (tiny) rectangle. The integral of p(x,y) over each tiny rectangle is therefore 
equal to the (essentially) uniform value of p x (x)p y (z - x) times the area dx Az: 

J p(x,y) dx dy = p x (x)p y (z - ,r) dx Az. (6.67) 

rectangle 

In other words, the integration over y is simply a multiplication by Az, at least for the 
way we sliced up the strip into thin vertical rectangles. 

We now need to perform the integration over x. That is, we need to integrate the result 
in Eq. (6.67) over all the little rectangles. This will give us the integral over the entire 
shaded strip, that is, it will give us p z (z) Az. Using the explicit Gaussian form of the 
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p’s from Eq. (4.42), with the means equal to zero, we obtain 
Px(x)p y (z - x) dxAz 


Pz(z) Az = ( 
=> Pz(z) = 


1 


sp27ta x sflju 


i r° 
na x J-o 


e -x^l2xrl e -(z-xfl2a\ dx _ 


( 6 . 68 ) 


To evaluate this integral, we will complete the square in the exponent. This will require 
some algebraic manipulation. With S = cr\ + cr^, the exponent equals 


1 


( —~2 + U ? A) ) = - 2 1 ((^x + <Xy)x 2 - 2 a 2 zx + a 2 z 2 ) 

W °y ! 2 °- x o-y 


2a 2 a 2 


2cr 2 CTy 


2 cr 2 cr 2 
S 

2 cr 2 cr 2 


2a~ z 
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(6.69) 


as you can verify. When we plug this expression for the exponent back into Eq. (6.68), 
the z 2 /2S term is a constant, as far as the x integration is concerned, so we can take 
it outside the integral. The remaining x integral is a standard Gaussian integral given 
by Eq. (4.118) in Problem 4.22, with b = S/(2a 2 a 2 ). (The integral is centered at 
a 2 z/S instead of zero, but that doesn’t matter, because the limits are ±oo.) Eq. (6.68) 
therefore becomes 
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(6.70) 


This is a Gaussian distribution with standard deviation Ja 2 + a 2 , as desired. 


Remarks: 

1. If the means p x and p y aren't zero, we can define new variables X' = X - p x 
and Y' = Y — p y . These have zero means, so by the above reasoning, the sum 
Z' = X' + Y' is a Gaussian with zero mean. The sum 

Z = X + Y = (X' + g x ) + (Y r + p y ) = Z' + (p x + p y ) (6.71) 

is therefore a Gaussian with mean p x + p y . 

2. Without doing any work, we already knew from Eq. (3.42) that the standard de¬ 
viation of Z is given by a 2 = a 2 + a 2 . (Standard deviations add in quadrature, 
for independent variables.) But it took the above calculation to show that the 
shape of the Z distribution is actually a Gaussian. 
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3. The result of this problem also holds for the difference of two Gaussians. That is, 
if X and Y are independent Gaussians with standard deviations cr x and cr y , then 

Z = X-Y is a Gaussian with standard deviation yjcr^ + crj . This follows from 
writing Z as X + i-Y) and noting that —Y has the same standard deviation as Y, 
namely cr y . Note that the standard deviation of Z = X - Y is not Jcr x - CTy. 

Consider the special case where Z is the difference between two independent 
and identically distributed variables X\ and X 2 , each with standard deviation 
crx • Then the preceding paragraph tells us that Z is a Gaussian with standard 

deviation sfla x . The incorrect Jcr x - <Ty answer mentioned above would 
yield cr z = 0, which certainly can’t be correct, because it would mean that Z is 
guaranteed to take on one specific value. 

4. A quicker and less rigorous solution to this problem is to say that if the sum 
X + Y takes on the particular value z, and if we are given x, then y must equal 
z - x. Integrating over x (to account for all of the different ways to obtain z) 
yields the second line in Eq. (6.68). So we can basically just start the solution 
with that equation. However, we chose to include all of the reasoning leading up 
to Eq. (6.68), because things can get confusing if you don’t clearly distinguish 
between probability densities, such as p x (x) and p(x,y), and actual probabili¬ 
ties, such as p x (x) dx and p{x,y) dxdy. It can also get confusing if you don’t 
distinguish between the different roles of dx and A z. The former is an infinitesi¬ 
mal integration variable, while the latter is the vertical width of the shaded strip 
in Fig. 6.26. Although technically the definition of the probability density p z (z) 
in Eq. (4.2) requires that A z be infinitesimal, we often think of it as simply being 
small. 

5. There is a slick alternative geometric argument that shows why the sum Z of 
two independent Gaussian distributions X and Y is again a Gaussian. We’ll just 
sketch the idea here; you can fill in the gaps. Consider first the case where X 
and Y have the same standard deviation cr. Then 

p{x,y) = p x {x)p y {y) <x e -(* 2 V)/2<r 2 = e -r 2 l2<T\ (6 . 72) 

where r is the radius in the x-y plane. Since p{x,y) depends only on r (and not 
on the angle 6 in the plane), we see that p(x,y) has circular symmetry. 

As in our original solution, the values of p z (z) for different values of z are 
proportional to the integrals of p(x,y) over the various thin strips tilted at a 45° 
angle shown in Fig. 6.27(a). We now note that due to the circular symmetry 
of p(x, y), the integrals over the strips are unchanged if we rotate the figure 
around the origin so that we end up with the vertical strips shown in Fig. 6.27(b). 
But we know that the integrals over these vertical strips are proportional to the 
original p x (,v) values, because integrating over all the y values in a strip just 
leaves us with p x (x) dx, by definition. Therefore, since p x {x) is a Gaussian, 
p z {z) must be also. To determine cr z , you can simply invoke Eq. (3.42), or you 
can use the following reasoning. If the circle shown in Fig. 6.27 has x and y 
intercepts of ±<x, then the rightmost strip in the right figure corresponds to one 
standard deviation <x, of the Z distribution, because this strip corresponds to one 
standard deviation cr x of the X distribution. But from the left figure, this strip is 
associated with the z value of xflcr, because the point (x,y) = (cr/ V2,cr/ V2) 
lies in the strip, which means that the corresponding z value is z = x + y = sflcr. 
Hence a z = xflcr. 
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Figure 6.27: (a) Each value of z corresponds to a particular shaded strip, (b) 
Due to the circular symmetry of p{x,y), the integrals of p(x,y) over the strips 
aren’t affected by a rotation in the plane. Therefore, since the vertical strips 
in the right figure yield the Gaussian p x (x) distribution, the diagonal strips 
associated with p z ( z ) in the left figure must also yield a Gaussian distribution. 


More generally, if X and Y have different standard deviations, then we have 
elliptical instead of circular symmetry in the plane. But if we stretch/squash one 
of the axes by the appropriate factor, we obtain circular symmetry, whereupon 
the above argument holds. It takes a little work to show geometrically that 
cr^ = cr^ + cr^. But again, you can find cr z by simply invoking Eq. (3.42). * 


6.5. Maximum p(x,y) 


First solution: Given y, we can maximize p(x,y ) by taking the partial derivative 
with respect to x. The exponent in Eq. (6.34) contains all of the dependence on x and 
y. Taking the partial derivative of the exponent with respect to x, and setting the result 
equal to zero, gives 


2x 

o= — 

cr- 


2 ry 


rcr x 

x = - y. 


(6.73) 

Jy Uy 

In the case at hand where y = yo, we see that p(x,y) is maximized when x = 
(■ rcr x /cry)y 0 . 


Second solution: We claim that the desired value of x is given by the intersection 
of the horizontal y = yo line with the upper regression line. Since we know from 
Fig. 6.15 that the equation for this line is y = (<x v lrcr x )x, we obtain 

cr y rcr x 

vo = -^-x => x = —-yo, (6.74) 

ra x <T y 

in agreement with the first solution. 

The above claim can be justified as follows. As we saw in Section 6.5, the curves of 
constant p(x,y) are ellipses. Two are shown in Fig. 6.28. The larger the ellipse, the 
smaller the value of p(x, y). The smallest ellipse that contains a point with a y value of 
VO is the inner ellipse shown in the figure. This ellipse is tangent to the horizontal line 
y = yo- The value of p(x,y) at the point B shown is larger than the value at points A 
and C, because these points lie on a larger ellipse. The point B therefore has the largest 
value of p(x,y) among all points on the horizontal line y = vq; all other points lie on 
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ellipses that are larger than the “B” ellipse. Our goal is therefore to find the x value 
of the point B. But point B , being the highest point on the ellipse on which it lies, is 
located on the upper regression line, because this line passes through the highest and 
lowest points of every ellipse. That is how we defined the upper regression line at the 
beginning of Section 6.7. This proves the above claim. 


Y 



Figure 6.28: The value of p(x,y) at B is larger than at A and C, because B lies on a 
smaller ellipse. B therefore has the largest p(x,y) among all points on the line y = yo- 

Since we know from Section 6.7 that the upper regression line gives the average value 
of x associated with a given y, we can phrase the result of this problem as: For a given 
yo. the value of x that maximizes p(x,yg) is the average value of x associated with yg. 

6.6. Distribution on a second test 

Fig. 6.29 shows what the scores on a second test might look like. We have increased 
the number of points from 70 to 700, just to smooth out the scatter plot. (You can 
pretend that there were 50,000 points in Fig. 6.20.) Although the 700 people all scored 
the same on the first test, they certainly won’t all score the same on the second test. 
However, if these 700 people were to take a third test (or any number of additional 
tests), their scores would look the same (on average) as they do in Fig. 6.29. We found 
in Section 6.8.1 that both the average innate ability and the average score of this group 
of people on the second test is 115. So if we measure X and Y relative to 100, the blob 
of points is centered at (p x ,p y ) = (15,15). 

What standard deviations did we use in numerically generating Fig. 6.29? <r z still 
has a value of 15/ V2 = 10.6 (from the paragraph preceding Eq. (6.38)). That never 
changes, since Z is independent of X. But cr’J ew now equals <x°. ld Vl - r 2 , because 
our 70 (or 700) points all came from a horizontal strip in Fig. 6.20, and the superscript 
on the Z in Eq. (6.36) tells us that cr° ld Vl - r 2 is the standard deviation of the spread 
of points in any horizontal strip. Since we know from Section 6.8.1 that r = 1/ V2 
and cr°. ld = 15/ V2, we have 



(6.75) 


So the values we used in generating Fig. 6.29 were <r" ew = 7.5 and cr z = 10.6. And 









328 


Chapter 6. Correlation and regression 



Figure 6.29: The second-test scores of 700 people with the same distribution of innate 
abilities as the 70 people in the shaded strip in Fig. 6.20. 


m still equals 1, because the relation Y = X + Z still holds; the change in the spread 
of the X values of the people we happen to be looking at doesn’t affect this relation. 
What is the standard deviation of the Y values in Fig. 6.29? From Eq. (6.5) we have 


A /m 2 (o-“ ew ) 2 + o-j = ^ (1) 2 (7.5) 2 + (10.6) 2 = 13. (6.76) 


This is smaller that the cr y = 15 value in Fig. 6.20, because the smaller spread in the X 
values affects Eq. (6.5) via cr x . From Eq. (6.6) the correlation coefficient for Fig. 6.29 
is 


-.new 


0)05) 

13 


0.58. 


(6.77) 


If you work out the numbers exactly, r turns out to be 1/ V3. This is smaller than the 
r = 1/ V2 value in Eq. (6.38) for Fig. 6.20, because a smaller fraction of cr y comes 
from <t x (since cr x is smaller). A larger fraction of <r y comes from cr z (which is 
unchanged). To summarize, in addition to (p x ,p y ) = (15,15), the values associated 
with Fig. 6.29 are 


= 7.5, 


= 13, 


cr 7 = 10.6, m = 1, 


= 0.58. 


(6.78) 


Remark: The regression line shown in Fig. 6.29 passes through the origin. Although 
this isn’t the case in general when the blob of points isn’t centered at the origin, it is 
the case here for the following reason. We know that the center of the blob lies on the 
lower regression line in Fig. 6.20; that’s how we found the center, after all. And the 
regression line in Fig. 6.29 has the same slope (namely m = 1) as the lower regression 
line in Fig. 6.20, because both plots are governed by the same relation, Y = X + Z. So 
the line must pass through the origin in Fig. 6.29. Another way to see why this is true 
is to recall that the regression line gives the average score for each value of X. And 
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the average score of an X = 0 person is still Y = 0 (where 0 really means 100 here), 
because the Z values average out to zero; this doesn’t depend on which figure we’re 
looking at. * 

6.7. One standard deviation above the mean 

The expected score of a person with any particular value of X is given by the associated 
point on the lower regression line. This line takes the form of Y = mX, where m = 
ra y /a x from Fig. 6.15. (We'll work with a general m here, even though m = 1 in our 
test-taking setups with Y = X + Z.) The expected score (relative to the mean score) of 
someone with an X value of a x (relative to the mean innate ability) is therefore 

ra v 

Y = —— • a x = ra y . (6.79) 

cr x 

This is just the ra y vertical distance shown in Fig. 6.16. To find the probability that 
the person achieves a score of at least a y , note that a y exceeds the expected test score 
of ra y by 

a y - ra y = cr v (l - r). (6.80) 

This is indicated in Fig. 6.30. We have drawn the standard-deviation box for clarity. 


Y 



Figure 6.30: The expected Y value associated with an X value of a x is Y = ra y , and 
a y exceeds this expected Y value by (1 - r)a y . 


Now, since Y = mX + Z, the probability distribution of anyone’s score is centered on 
the associated point on the lower regression line and has a standard deviation of a z . 
But a z = a y V1 - r- from Eq. (6.18). So for our given person with X = a x , a score 
of a y exceeds the expected score of ra y by 


Oy(l-r) _ j 1 - r 
cr-y Vl -r 2 V 1 + r 


(6.81) 


of the a z standard deviations. 

To produce a numerical answer to this problem, we must be given a numerical value 
for r. For example, if r = 0.5, then V(1 _ r)/(l + r) = 0.58. From a table or 
computer, it can be shown that the probability of lying outside of 0.58 standard de¬ 
viations from the mean is 0.56 (assuming a Gaussian distribution). But we must di¬ 
vide by 2 because we’re concerned only with the upper tail of the Gaussian. So the 
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desired probability is 0.28. The situation is shown in Fig. 6.31. If r = 0.5 then 
<r z = cr y yj 1 - (0.5) 2 = (0.87)cr-y. We have arbitrarily chosen cr y = cr x in the figure 
(and we have set them both equal to 1; the standard-deviation box is shown), but this 
doesn’t affect our results. If r = 0.5, then cr z = (0.87)cry, no matter how cr x and 
< T y are related. The cr z = 0.87 standard deviation is indicated by the heavy arrows, 
centered on the expected value given by the lower regression line. A visual inspection 
of the figure is consistent with the fact that 28% of the dots in the vertical shaded strip 
are expected to lie above the white dot with height Y = cr y . In the present example 
with r = 0.5, both of the rcr y and (1 - r)cr y vertical distances in Fig. 6.30 are equal 
to (0.5 )cr y . The upper of these (identical) distances is therefore (0.5)/(0.87) = 0.58 
times cr -, as we found above by plugging r = 0.5 into Eq. (6.81). 



Figure 6.31: If r = 0.5, then cr- = (0.87)cy. This standard deviation is indicated by 
the heavy arrows, centered on the lower regression line. A person with X - cr x has a 
28% chance of scoring above Y = cr y , indicated by the white dot. The square is the 
standard-deviation box, with cr x and cr y arbitrarily chosen to be 1. 


Remark: We can check some limits of the V(1 - r)/(l + r) result in Eq. (6.81). If 
r = 0 (no correlation), then Eq. (6.81) reduces to 1 of the cr z standard deviations. This 
makes sense, because the Y = mX + Z relation reduces to Y = Z when the correlation 
coefficient r is zero (which comes about by having m —» 0 or cr z » cr x in Eq. (6.6)). 
In this case, the lower regression line has slope zero, which means that it is simply 
the X axis. So a score of cr y (= cr z ) above the overall mean of Y (which is zero) 
is the same as a score of one cr z standard deviation above the regression line (the X 
axis). The desired probability is then 0.16, because this is half of the 1 - 0.68 = 0.32 
probability of lying outside of one standard deviation. 

If r = 1 (perfect correlation), then Eq. (6.81) reduces to 0 of the cr z standard de¬ 
viations. The desired probability is therefore 1/2, because that is the probability of 
exceeding the mean of a Gaussian distribution (which is equivalent to lying outside 
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of zero standard deviations from the mean). This result isn't so obvious, because the 
two relevant quantities in Eq. (6.81) (namely, the distance cry (1 - r) in Fig. 6.30, and 
the standard deviation cr z = cr y V1 - r 2 ) both go to zero as r approaches 1. But in 
the r —* 1 limit, the distance cr y { \ - r) goes to zero faster than the standard deviation 
cr z = cry V1 - r 2 . So in Fig. 6.30, cr y exceeds rcr y by essentially zero of the cr z 
standard deviations. * 

6 . 8 . Alternate form of B 

If we plug the first expression for A from Eq. (6.47) into the second expression for B 
in Eq. (6.49), we obtain 


B = (y) - A{x) 

, N l (xy) - (x)(y) t , x 

= (y) ~ I —- —rTT" I <*> 


(x 2 ) - (x) 2 
= ((y)(x 2 ) -3yX*f) - ((xy)(x) 
(x 2 ) - (x) 2 
= (y)(x 2 ) - (x)(xy) 

(x 2 ) - (x) 1 

which is correctly the first expression for B in Eq. (6.49). 
6.9. Finding all the quantities 


The means are 

2+3+3+5+7 


1+1+3+4+6 

= 4 and y = --- = 3. 


(6.82) 


(6.83) 


The standard deviations are then 


(2-4) 2 + (3-4) 2 +(3-4) 2 + (5-4) 2 + (7-4) 2 
■V* = A -^-= 1.79, 


(1 - 3)2 + (1 - 3)2 + (3 - 3)2 + (4 - 3)2 + (6 - 3) 2 
•S\, = A -;- = 1-90. 


The covariance is 

Cov(x, v) = <2-4J(1 -3) + (3-4)(1 -3)+(3-4)(3-3)+(5-4)(4-3) + (7-4)(6-3) _ 3 2 

The correlation coefficient r is then 

Cov(.v,y) 3.2 

r = - — = - = 0.94. 

s x s y (1.79) (1.90) 

The slope m of the lower regression line is 

rs y (0.94)(1.9) 

1.79 =L °- 

Equivalently, Eq. (6.47) gives the slope A (which equals m) as 

A = CovCy.y) _ —— 1.0. 
s 2 (1.79)2 


(6.84) 


(6.85) 


( 6 . 86 ) 


(6.87) 


( 6 . 88 ) 
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It turns out that the A = m slope of the regression (least-squares) line is exactly equal 
to 1, as we will see below. 


If we want to use the first expression for B in Eq. (6.49), we must calculate ( x 2 ) and 
{xy). You can quickly show that these values are 19.2 and 15.2, respectively. So B 
equals 


<y)<* 2 ) - <*)<*y> = (3)(19.2)-(4X15.2) 
(x 2 )-{x) 2 19.2-4 2 


(6.89) 


This result is exact. Alternatively and more quickly, the second expression for B 
in Eq. (6.49) gives B = (y) - A{x) = 3 — (1)(4) = -1. Fig. 6.32 shows the line 
y = Ax + B ==> y = x - 1 superimposed on the plot of the five given points. We see 
that the line passes through three of the points. In retrospect, it is clear that we can’t 
do any better than this line when minimizing the sum of the squares of the vertical 
distances from the points to the line. This is true because for the three points on the 
line, we can’t do any better than zero distance. And for the two points (3,1) and (3,3) 
off the line, we can’t do any better than having the line pass through the point (3,2) 
midway between them. (As an exercise, you can prove this.) In most setups, however, 
the location of the least-squares line isn’t so obvious. The small number of points in 
this problem just happened to be located very nicely with respect to each other. 


y 



y = Ax +B 

= x-l 


x 


Figure 6.32: The five given points, along with the regression (least-squares) line. 


6.10. Equal distances 

(a) Given the x ; values, we want to find the value of x\, that minimizes the sum, 


n 

s= 5>-*b) 2 . 
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(6.90) 
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To do this, we just need to set the derivative dS/dx t, equal to zero. This gives 

dS 


0 = - Xi ) + nx b 


2 x i - 

X\> — - = x. 

n 


(6.91) 


(b) The first line in Eq. (6.91) tells us that Y.( x i ~ x b) ~ 0- In words, it tells us that 
the sum of the signed differences from x b to all of the x t - points equals zero. The 
points with x; > Xb yield positive differences, while the points with x; < Xb 
yield negative differences. If the sum of the former set of differences is d , then 
the sum of the latter must be - d , so that the sum of all the differences is zero. 
If we now convert the previous sentence to a statement about distances (which 
are the absolute values of the signed differences, and hence always positive), we 
see that d is the sum of the distances from Xb to the points with x, > Xb, and d 
is also the sum of the distances from Xb to the points with x; < Xb- These two 
sums are therefore equal, as desired. 

Combining the results in parts (a) and (b), we see that the mean x has two 
important properties: (1) the sum of the squares of the distances from x to the 
n given points is smaller for x than for any other value, and (2) the sum of the 
distances from x to the points above it equals the sum of the distances from x to 
the points below it. 

Note that our definition of the “best-fit” point in terms of the minimum sum of 
the squared distances is essentially the same as the definition we used in Sec¬ 
tion 6.4 for the “badness” of a prediction. Both definitions involve the variance. 
But they differ in that the badness definition involves an expectation value over 
points that you will pick in the future, whereas the best-fit point involves an av¬ 
erage over points that you have already picked; there is no expecting going on 
in this case. However, if you pick a very large number of points from a given 
distribution, then the best-fit point x will be very close to the mean p of the 
distribution (which is the point with the least badness). 


Remark: Why did we define the best-fit point to be the point that minimizes 
the sum of the squares of the distances? Why not define it to be the point that 
just minimizes the sum of the distances (not squared)? There are two reasons 
why this latter definition isn't ideal. First, distances involve absolute values like 
\x{ - Xb I - and absolute values are somewhat messy to deal with mathematically. 
They involve two cases: If z is positive then |z| = z, but if z is negative then 
|j| = —z. Squares, on the other hand, are automatically positive (or zero). 
Second, the point that minimizes the sum of the distances is simply not the point 
that most people would consider to be the best-fit point, because this point turns 
out not to be the mean, but rather the median (see below). The median is defined 
to be the point for which half of the x; lie above it and half lie below it, with 
no regard for how far the various points are above or below. For example, if 
we have five x, values, 2, 3, 5, 90, 100, then the median is 5 and the mean is 
40. Most people would probably say that the mean 40 is more indicative of 
these five numbers than the median 5. The median doesn't take into account the 
spacing between the numbers. 

To show that the above minimum-distance-sum (not squared) definition of the 
best-fit point leads to the median, we can give a quick proof by contradiction. 
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Assume that the best-fit point x\, is not the median. Then there are n\ points 
below x\, and n 2 points above x\,, where n | t no. Ifni > n 2 (the n\ < ri 2 case 
proceeds similarly), we can decrease the sum of all the distances by decreasing 
Xfr slightly by, say, d. This will decrease ni distances by d and increase no dis¬ 
tances by d. And since ni > no, the overall sum of the distances will decrease. 
This contradicts the fact that x\, was assumed to yield the minimum sum. The 
only way to escape this contradiction is for ni to equal no. That is, xj, is the 
median. If the number of points is odd, then x b equals the middle point. If the 
number is even, then x\, can lie anywhere between the middle two points. * 

6.11. Equal distances again 

If we take the partial derivative of the sum in Eq. (6.42) with respect to B, we obtain 

dS ^ 

0 = bb = “ 2 Z [yi “ (Axi + S) ] ■ (6 - 92) 

The yi - (Ax; + B) terms here are the signed vertical differences between the given 
points and the line. The above equation therefore says that the sum of these signed 
distances is zero. This is exactly analogous to the fact that the sum £(x; -xb) equaled 
zero in part (b) of Problem 6.10. So by the same reasoning we used there, we see that 
the sum of the vertical distances above the line equals the sum of the vertical distances 
below the line. 

Note that the partial derivative of 5 with respect to A is -2 2 x; [y; - (Ax; + B)]. We 
can't conclude much from this, due to the x; factor, which makes the terms in the sum 
not be the signed vertical differences. 

Remark: As in the remark in the solution to Problem 6.10, minimizing the sum of the 
distances (instead of their squares) is generally an inferior way to define the best-fit 
line. By the same reasoning we used in the 1-D case, this definition leads to a line that 
has half of the given points above it, and half below it, with no regard for how far the 
various points are above or below. Most people wouldn't consider such a line to be 
the line that best describes the given set of points. * 
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Appendices 


7.1 Appendix A: Subtleties about probability 

In this appendix we will discuss a number of subtle issues with probability. This 
material isn’t necessary for the content in this book, so it can be skipped on a first 
reading. 

Determining probabilities 

How do you determine the probability that a given event will occur? There are two 
ways: You can calculate it theoretically, or you can estimate it experimentally by 
performing a large number of trials of the process. 

We can use a theoretical argument to determine, for example, the probability of 
obtaining Heads on a coin toss. There is no need to actually perform a coin toss, 
because it suffices to just think about it and note that the two possibilities of Heads 
and Tails are equally likely (assuming a fair coin). Each possibility must therefore 
occur half the time, which means that the probability of each is 1/2. Similar reason¬ 
ing gives probabilities of 1/6 for each of the six possible rolls of a die (assuming a 
fair die). 

However, there are certainly many situations where we don’t have enough infor¬ 
mation to calculate the probability by theoretical means. In these cases we have no 
choice but to perform a large number of trials and then assume that the true prob¬ 
ability is roughly equal to the fraction of events that occurred. For example, let’s 
say that you take a bus to school or work, and that sometimes the bus is early and 
sometimes it’s late. What is the probability that it is early? There are countless 
things that influence the bus’s timing: traffic, weather, engine issues, delays caused 
by other passengers, slow service at a restaurant the night before which caused the 
driver to see a later movie than planned which caused him to go to bed later than 
usual and hence get up later than usual which caused him to start the route two min¬ 
utes late, and so on and so forth. It is clearly hopeless to try to incorporate all of 
these effects into some sort of theoretical reasoning that produces a result that can 
be trusted. The only option is to observe what happens during a reasonably large 
number of days, and to assume that the fraction of early arrivals that you observe is 
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roughly the desired probability. If the bus is early on 20 out of 50 days, then we can 
say that the probability of being early is most likely somewhere around 40%. 

Of course, having generated this result of 40%, it just might happen that a con¬ 
struction project on the route starts the next day, which makes the bus late every day 
for the next three months. So probabilities based on observation should be taken 
with a grain of salt! 

A similar situation arises with, say, basketball free-throw percentages. There 
is absolutely no hope of theoretically calculating the probability of a certain player 
hitting a free throw, because it would require knowing everything that’s going on 
from the thoughts in her head to the muscles in her fingers to the air currents on the 
way to the basket. All we can say is that if the player has hit a certain fraction of the 
free throws she’s already attempted, then that’s our best guess for the probability of 
hitting free throws in the future. 

True randomness 

We stated above that the probability of a coin toss resulting in Heads is 1/2. The 
reasoning was that Heads and Tails should have equal probabilities if everything is 
random, which means that they must each be 1/2. But is the toss truly random? 
What if we know the exact torque and force that you apply to the coin? We can then 
know exactly how fast it spins and how long it stays in the air (assuming that we 
know the density and viscosity of the air, etc.). And if we know the makeups of the 
ground and the coin, we can figure out exactly how the coin bounces, which will 
allow us to determine which side will land facing up. And even if we don’t know all 
these things, they all have definite values, independent of our knowledge of them. 
So once the coin leaves our hand, the side that will land facing up is completely 
determined. The “random” nature of the toss is therefore nothing more than a result 
of our ignorance of the properties of the coin and its surroundings. 

The question then arises: How do we create a process that is truly random? It’s 
a good bet that if you try to create a random process, you’ll discover that it actually 
isn’t random. Instead, it just appears to be random due to your lack of knowledge of 
various inputs at the start of the process. You might try to make a coin toss random 
by having a machine flip the coin, where the force and torque that it applies to the 
coin take on random values. But how do we make these things random? All we’ve 
done is shift the burden of proof back a step, so we haven’t really accomplished 
anything. 

This state of affairs is particularly relevant when computers are used to generate 
random numbers. By various processes, computers can produce numbers that seem 
to be random. However, there is no way that they can be truly random, because the 
output is completely determined by the input. If the input isn’t random (we’re as¬ 
suming it isn’t, because otherwise we wouldn’t need a random number generator!), 
then the output isn’t random either. 

In the above coin-toss scenarios, the issue at hand is that our definition of prob¬ 
ability in Section 2.1 involved the phrase, “a very large number of identical trials.” 
In none of the coin-toss scenarios are the trials identical. They all have (slightly) 
different inputs. So it’s no surprise that things aren’t truly random. 

This then brings up the question: If we have truly identical processes, then 
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shouldn’t they give exactly identical results? If we flip a coin in exactly the same 
manner each time, then we should get exactly the same outcome each time. So our 
definition of probability seems to preclude true randomness! This makes us wonder 
if there are actually any processes that can be truly identical and at the same time 
yield different results. 

Indeed there are. It turns out that in quantum mechanics, this is exactly what 
happens. It is possible to have two exactly identical process that yield different re¬ 
sults. Things are truly random; you can’t trace the different outcomes to different 
inputs. A great deal of effort has gone into investigating this randomness, and unless 
our view of the universe is way off-base, there are processes in quantum mechan¬ 
ics that involve true randomness. 1 If you think about this hard enough, it should 
make your head hurt. Our experiences in everyday life tell us that things happen be¬ 
cause other things happened. But not so in quantum mechanics. There is no causal 
structure in certain settings. Some things just happen. Period. 

But even without quantum mechanics, there are plenty of physical processes in 
the world that are essentially random, for all practical purposes. The ingredient that 
makes these processes essentially random is generally either (1) the sheer largeness 
of the numbers (of molecules, for example) involved, or (2) the phenomenon of 
“chaos,” which turns small uncertainties into huge ones. Using these ingredients, it 
is possible to create methods for generating nearly random numbers. For example, 
the noise in the radio frequency range in the atmosphere generates randomness due 
to the absurdly large number of input bits of data (see www.random.org). And the 
pingpong balls bouncing around in a box used for picking lottery numbers generate 
randomness due to the chaotic nature of the ball collisions. 


Different information 

Let’s say that I flip a coin and then look at the result and see a Heads, but I don’t 
show you. Then for you, the probability of the coin being Heads is 1/2. But for 
me, the probability is 1. So if someone asks for the probability of the coin showing 
Heads, which number is it, 1/2 or 1? Well, there isn’t a unique answer to this 
question, because the question is an incomplete one. The correct question to ask is, 
“What is the probability of the coin showing Heads, as measured by such-and-such 
a person?” You have to state who is calculating the probability, because different 
people have different information, and this affects the probability. 

However, you might argue that it’s the same process, so it should have a uniquely- 
defined probability, independent of who is measuring it. But it actually isn’t the 
same process for the two of us. The process for me involves looking at the coin, 
whereas the process for you doesn’t. Said in another way, our definition of proba¬ 
bility involved the phrase, “a very large number of identical trials.” As far as you’re 
concerned, if we do 1000 trials of this process, they’re all identical to you. But 
they certainly aren’t identical to me, because for some of them I observe Heads, and 
for some I observe Tails. This is about as nonidentical as they can be. Said in yet 
another way, we are talking about two fundamentally different probabilities. One 

! Of course, based on induction over the millennia, our view of the universe probably is way off-base. 
But let’s not get into that here. 
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is the probability that the coin shows Heads, given no other information; this prob¬ 
ability is 1/2. The other is the conditional probability that the coin shows Heads, 
given that it is observed to be Heads; this probability is 1. 

“On average” 

We now come to the most troublesome issue with probability. At the beginning of 
Section 2.1, we stated our definition of probability; “Consider a very large number 
N of identical trials of a certain process. If the probability of a particular event 
occurring is p , then the event will occur in a fraction p of the trials, on average.” 
There are two related issues here: What do we mean by a “very large” number N 
of trials, and what do we mean by “on average”? Is N = 10 9 (one billion) large? 
It seems large when talking about coin flips, but it isn’t large when talking about 
an event with p = 1/10 9 . It turns out that the largeness of N actually isn’t a huge 
issue, due to the words “on average.” We can simply consider a very large number 
N' of sets, each consisting of N trials. (Of course, we’re using the words “very 
large” again here.) However, the words “on average” introduce the following more 
problematic issue. 

First, note that the definition of probability wouldn’t make any sense without the 
words “on average,” because there is no guarantee that an event will occur in exactly 
a fraction p of the trials. (Relaxing the condition to involve a small interval around 
p doesn’t help, because there is still no guarantee of ending up in that interval.) 
Second, given that the words “on average” must appear, we see that we must take an 
average over a large number N' of sets, each consisting of N trials. (This averaging 
must be done, independent of the size of N .) In each of the N' sets, the event 
will occur in a certain fraction of the N trials. If we take the average of these 
N' fractions, we will obtain p, on average. But since we just said the words “on 
average” again, we now need to consider a large number N" of sets, each consisting 
of N' sets, each consisting of N trials of the process. If we take the average of N" 
numbers, each of which is the average of N' fractions (the fractions for the different 
groups of N trials), then we should obtain p ... on average! 

You can see where we’re going here. There is no way to end the process. We 
can never be certain that we will end up with an average of p. Or more precisely, we 
can never be certain that we will end up with an average that is within, say, 0.0001 
(or some other small number of our choosing) of p. Every statement we can make 
will always end with the words “on average.” So we must always tack on one more 
iteration. Every time we say “on average,” we shift the burden of proof to the next 
step. Our definition of probability is therefore circular. Or perhaps “a never-ending 
linear chain" would be a more accurate description. 

Note that when considering N" sets of N' sets of N trials, we’re simply per¬ 
forming N"N'N trials. So instead of thinking in terms of sets of sets of trials, etc., 
we can consider one extremely large set of N"N'N trials. It’s the same overall set 
of trials, so we will observe the same fraction of trials in which an events occurs. 
However, any statement we make about this set will still end with the words “on av¬ 
erage.” So we’re still going to need to consider N'" sets of the number of preceding 
trials, regardless of how we feel like subdividing that number. At any stage, we will 
always need to consider a large number of sets of the number of trials we’ve already 
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done. 

Now, you might think that this is all a bit silly, because everyone knows that 
the probability of a fair coin showing Heads is 1/2. You can produce evidence 
for this statement by flipping a million coins and checking that the percentage of 
Heads lies between, say, 49% and 51%. Or you can flip a trillion coins and check 
that the percentage of Heads lies between, say, 49.999% and 50.001%. Or you 
can flip a larger number of coins and specify a narrower range. In the two cases just 
mentioned, the calculated probabilities of lying in the given range are the same, with 
the common value being essentially equal to 1. More precisely, the probability of 
lying outside the range is the ridiculously small number 5 • 10 -89 . See Problem 5.3 
to get an idea of how small this number really is. 

However, even with such a small probability, you might get Heads more than 
50.001% of the time in a trillion flips. It’s certainly unlikely, and to show that it 
is indeed unlikely, you could consider a large number of sets, each consisting of a 
trillion coin flips. You will likely find that an extremely small fraction of these sets 
have Heads occurring more than 50.001% of the time. But since we just said the 
word “likely,” it is understood that we need to consider a large number of sets, each 
consisting of a large number of sets, each consisting of a trillion trials. And so on. 
The point is that no matter how many trials you do, you can never be absolutely sure 
that you haven’t simply had bad (or good) luck. And, unfortunately, the preceding 
sentence is one thing you can be sure about. There will never be a magical large 
number for which things abruptly turn from probable to definite. So in that sense, 
an extremely large number like lO 1000 is no better than an everyday number like 10. 
They are fundamentally the same. Any differences are theoretically just a matter of 
degree. 

Having said all this, it would be a monumental mistake to discard the entire 
theory of probability, just because there are some philosophical issues with its un¬ 
derpinnings (which we have certainly not resolved here; our goal in this section 
was only to make you aware of them). The fact of the matter is that, in practice, 
probability works. Day after day, it proves invaluable in everything from finance 
to sports to politics to the fact that we don’t all spontaneously combust. Therefore, 
in this book we will take a practical approach, where we intuitively know that the 
probability of getting Heads on a coin flip is 1 /2, the probability of rolling a 5 on a 
die is 1 /6, and so on. Feel free to ponder the philosophy of probability, but don’t let 
that stop you from using probability! 


7.2 Appendix B: Euler’s number, e 

7.2.1 Definition of e 

Consider the expression, 

1 V 

l + n) ' (7 ' 1} 

Admittedly, this comes a bit out of the blue, but let’s not worry about the motivation 
for now. After we derive a number of interesting results below, you’ll see why we 
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chose to consider this particular expression. Table 7.1 gives the values of (1 + 1 /«)" 
for various integer values of n. (Non-integers are fine to consider, too.) 


n 

1 

2 

5 

10 

10 2 

10 3 

10 4 

10 5 

10 6 

(1 + 1/;;)" 

2 

2.25 

2.49 

2.59 

2.705 

2.717 

2.71815 

2.71827 

2.7182805 


Table 7.1: The values of (1 + 1 /«)" approach a definite number, approximately 2.71828, 
which we call e. 


Apparently, the values converge to a number somewhere around 2.71828. This can 
also be seen in Fig. 7.1, which shows a plot of (1 + 1 /n) n vs. log(n). The log(n) 
here means that the “0” on the x axis corresponds n = 10° = 1, the “1” corresponds 
n = 10 1 = 10, the “2” corresponds n = 10 2 = 100, and so on. 


(1+1/;;)” 



logio(w) 


Figure 7.1: The plot of (1 + 1/;;)” approaches e. 


It is clear that even before we reach the “6” (that is, n = 10 6 = 1,000,000), 
the curve has essentially leveled off to a constant value. This value happens to be 
2.7182818284.... It turns out that the digits in this number go on forever, with no 
overall pattern. However, the fortuitous double appearance of the “1828” makes it 
fairly easy to remember to 10 digits, although you’ll rarely ever need more accuracy 
than 2.718. The exact number is known as Euler’s number, and it is denoted by the 
letter e. The precise definition of e in terms of the expression in Eq. (7.1) is 


e = lim 1 + 


2.71828 


(7.2) 


The “lim” notation simply means that we’re taking the limit of this expression as n 
approaches infinity. If you don’t like dealing with limits or infinity, just set n equal 
to a very large number like 10 10 , and then you pretty much have the value of e. 

Remember that Eq. (7.2) is a definition. There’s no actual content in it. All 
we did was take the quantity (1 + 1 /n) n and look at what value it approaches as n 
becomes very large, and then we decided to call the result “e.” We will, however, 
derive some actual results below, which aren’t just definitions. 
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Remark: If we didn’t use a log plot in Fig. 7.1 and instead just graphed (1 + l/n)” vs. n, the 
plot would stretch far out to the right if we wanted to go up to a large number like n = 10°. 
Of course, we could shrink the plot in the horizontal direction, but then the region of small 
values of n would be squeezed down to essentially nothing. For example, the region up to 
n = 100 would take up only 0.01% of the plot. We would therefore be left with basically just 
a horizontal line. Even if we go up to only n = 10 4 , we end up with the essentially horizontal 
straight line shown in Fig. 7.2, preceded by an essentially vertical jump from 2.0 to 2.718. 


(1+1/w)" 

2.8 r 

'- 

2.6 

2.4 

2.2 


2.0 L 
0 


. n 

2000 4000 6000 8000 10000 


Figure 7.2: The plot of (1 + 1 /«)” vs. n, with n measured on a linear scale. 


The features in the left part of the plot in Fig. 7.1 aren’t so visible in Fig. 7.2. You can barely 
see the bend in the curve. Log plots are used to prevent the larger numbers from dominating 
the plot, as they do in Fig. 7.2. This issue isn’t so critical here, since we’re concerned only 
with what (1 + 1 /n)” looks like for large n, but nevertheless it’s often more informative to 
use a log plot in certain settings. * 

It is quite interesting that (1 + 1 /n) n approaches a definite finite value as n 
gets larger and larger. On one hand, you might think that because the l/n term 
gets smaller and smaller (which means that (1 + l/n) gets closer and closer to 1), 
the whole expression should get closer and closer to 1, because 1 raised to any 
power is 1. On the other hand, you might think that because the exponent n gets 
larger and larger, the whole expression should get larger and larger and approach 
infinity, because we’re raising something to an ever-increasing power. It turns out 
that (1 + l/n)’ 1 does neither of these things. Instead, these two effects cancel, and 
the result ends up somewhere between 1 and oo, at the particular value of about 
2.71828. 

As mentioned above, we introduced (1 + l/n)” a bit out of the blue. But we’ve 
already found one interesting feature of it, namely that it approaches a definite fi¬ 
nite number (which we labeled as “e”) as n goes to oo. And there are many other 
features; so many, in fact, that e ends up being arguably the most important number 
in mathematics, with the possible exception of n. (But my vote is for e\) From 
the nearly endless list of interesting facts about e, we include three in the following 
three subsections. 


7.2.2 Raising e to a power 

What do we get when we raise e to a power? That is, what is the value of e x l There 
are (at least) two ways to answer this. The simple way is to just use your calculator 





342 


Chapter 7. Appendices 


to raise e = 2.71828 to the power x. A number will pop out, and that’s that. 

However, there is another way which turns out to be immensely useful in the 
study of probability. If we relabel the n in Eq. (7.2) as in (for convenience), and if 
we then define n = mx in the fourth line below, we obtain 


e x = lim 111 + — 


j \ mx 

= lim 11 + - 

m) 


= lim 1 + — ] 

m—>oo ' 


( 1+ -)‘ 

1 \ mx / 

lim (l + 

ii->oo \ n J 


(using m instead of n in Eq. (7.2)) 
(multiplying exponents) 

(multiplying by 1 in the form of x/x ) 
(defining n = mx) (7.3) 


In the case where n is large but not infinite, we can replace the “=” sign with a 
sign: 




(for large n) 


(7.4) 


The bigger the n, the better the approximation. The condition under which the 
approximation is a good one is 

x <s yfn. (7.5) 


This will usually hold in the situations we’ll be dealing with (although there are 
a few exceptions in Chapter 5). We’ll just accept this condition here, but see the 
second bullet-point case (the no 2 <s 1 one) in Appendix C if you want to know 
where it comes from. 

Eq. (7.4) is a rather nice result. The x that appears in the numerator of the 
fraction is simply the exponent of e. It almost seems like too simple a generalization 
of Eq. (7.2) to be correct. (Eq. (7.2) is a special case of Eq. (7.4), with x = 1.) Let’s 
check that Eq. (7.4) does indeed hold for, say, x = 2. If we pick n = 10 6 (which 
certainly satisfies the x <sz yfn condition), we obtain (1 + x/n) n = (1 + 2/10 6 ) 10 = 
7.389041. This is very close to the true value of <? 2 , which is about 7.389056. Larger 
values of n will make it even closer. 


Example 1 (Compound interest): Assume that you have a bank account for which 
the interest rate is 5% per year. If this 5% is simply applied as a one-time addition at 
the end of the year, then after one year you will have 1.05 times the amount of money 
you started with. However, another way for the interest to be applied is for it to be 
compounded (applied) daily, with (5%)/365 being the daily rate (which happens to 
be about 0.014%). That is, your money at the end of each day equals 1 + (0.05)/365 
times what you had at the beginning of that day. In this scenario, by what factor does 
your money increase after one year? 


Solution: Your money gets multiplied by a factor of 1 + (0.05)/365 each day, so at 
the end of one year (365 days), it has increased by the factor. 


0.05 \ 
365 ") 


365 


(7.6) 
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But this has exactly the same form as the expression in Eq. (7.4), with x = 0.05 and 
n = 365 (which certainly satisfies the x <K sfli condition). So Eq. (7.4) tells us that 
after one year, your money increases by the factor e 0 ' 05 ss 1.051. (Of course, you 
can also just plug the original expression (1 + 0.05/365) 365 into your calculator. The 
result is essentially the same.) Since your money increases by a factor of 1.051, the 
effective yearly interest rate is 5.1%. That is, someone who has a 5.1% interest rate 
that is applied as a one-time addition at the end of the year will end up with the same 
amount of money as you (assuming that the starting amounts were the same). 

This effective interest rate of 5.1% is called the yield. So an annual rate of 5% has a 
yield of 5.1%. This yield is larger than 5% because the interest rate each day is being 
applied not only to your initial amount, but also to all the interest you’ve received in 
the preceding days. In short, you’re earning interest on your interest. 

The increase by 0.1% isn’t so much. But if the annual interest rate is instead 10%, and 
if it is compounded daily, then the above reasoning implies that you will end up with 
a yearly factor of e 0 ' 10 = 1.105, which means that the yield is 10.5%. And an annual 
rate of 20% (admittedly rather unrealistic) produces a yearly factor of e 0 ' 20 = 1.22, 
which means that the yield is 22%. 


Example 2 (Doubling your money): In the 5% scenario in the above example, the ef¬ 
fect of compound interest (that is, earning interest on the interest) over one year could 
pretty much be ignored, because it was only 0.1%. However, the effect of compound 
interest cannot be ignored in the following question: If the annual interest rate is 5%, 
and if it is compounded daily, how many years will it take to double your money? 

Solution: First, note the following incorrect line of reasoning: If you start with N 
dollars, then doubling your money means that you eventually need to increase it by 
another N dollars. Since it increases by about (0.05)iV each year, you need about 20 
of these increases (because 20-(0.05) = 1) to obtain the desired increase of N. So it 
takes 20 years. However, this is incorrect, because it ignores the fact that you have 
more money in each successive year and are hence earning interest on a larger and 
larger amount of money. The “since it increases by about (0.05)(V each year” clause 
above is therefore incorrect. Even the slightly more correct figure of (0.051 )N is still 
plenty wrong. The correct line of reasoning is the following. 

We saw in the previous example that at the end of each year, your money increases 
by a factor of e 0 ' 05 compared with what it was at the beginning of the year. So after 
n years it increases by n of these factors, that is, by (e 0 - 05 )' 1 which equals e(°- 05 > n . 
We want to find the value of n for which this overall factor equals 2. A little trial 
and error in your calculator shows that e 0 ' 1 ~ 2. (In the language of logs, this is the 
statement that log e 2 « 0.7, or equivalently In 2 = 0.7. But this terminology isn’t 
important here.) So we need the (0.05)/? exponent to equal 0.7, which in turn implies 
that n = (0.7)/(0.05) = 14. It therefore takes 14 years to double your money. 

You can think of this result for n as 70 divided by 5. For a general yearly interest rate 
of r%, the same reasoning we used above shows that the number of years required to 
double your money is 70 divided by r. For example, with a 10% rate, your money will 
double in 7 years. In remembering this general rule, you just need to remember one 
number: 70. Equivalently, the time it takes to double your money is 70% of the naive 
answer that ignores the effect of compound interest. From the first paragraph above, 
this naive answer is 100 divided by r. 

Unlike the previous example where the interest earned was small (because we were 
considering only one year), the interest earned in this example is large; it equals N 
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dollars by the end. So the effect of earning interest on your interest (that is, the effect 
of compound interest) cannot be ignored. 

Note that even if you don’t compound the interest daily (that is, even if you simply 
apply the 5% at the end of each year), it will still take essentially 14 years to double 
your money, because (1.05) 14 = 1.98 « 2. The extra 0.1% earned each year when the 
interest is compounded daily doesn’t make much of a difference here. * 


7.2.3 The infinite series for e x 

Eq. (7.3), or equivalently Eq. (7.4), gives an expression for e x . Another rather 
interesting expression for e x that we can derive is 


1 + X H- 

2 ! 




(7.7) 


The first two terms here can be written as jc°/ 0! and x 1 /I!, so all of the terms take 
the form of x n /n\, where n runs from zero to infinity. In calculus language, Eq. (7.7) 
is known as the Taylor series for e x . But that’s just a name, so ignore it if you’ve 
never heard of it. We’ll give a derivation of Eq. (7.7) below, but let’s first look at a 
few of its consequences. 

A special case of Eq. (7.7) occurs when x = 1, which yields 


e 





(7.8) 


These terms get very small very quickly, so you don’t need to include many of 
them to get a good approximation to e. Even just going out to the 10! term gives 
e ~ 2.71828180, which is accurate to the seventh digit beyond the decimal point. 

A quick corollary to Eq. (7.7) is that if x is small, we can write 


e* » 1 + x. (7.9) 

This is true because if x is small then the x“/2! term, along with all the higher 
powers of x in Eq. (7.7), are small compared with x. We can therefore ignore them. 
You should verify with a calculator that Eq. (7.9) is a good approximation for small 
x. You can let x be 0.01 or 0.001, etc. The number e is the one special number for 
which Eq. (7.9) holds. It is not the case that 2 X « 1 + x or 10* « 1 + x, as you can 
verify. 

Of course, we can also say (by using the exact same reasoning we just used) 
that if x is small then the x term in Eq. (7.7), along with all the higher powers of x, 
are small compared with 1. If we ignore all these terms, we obtain the very coarse 
approximation: e x « 1. This is indeed an approximation to e x for small x, but the 
question is whether it is good enough for whatever purpose you have in mind. If it 
isn’t, then you need to use the e x w 1 + x expression. And similarly, if that isn’t 
good enough for your purposes, then you need to keep the next term in Eq. (7.7) and 
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write e x * I +x+x 2 / 2. And so on. But in many cases the e 1 » 1 + x approximation 
gets the job done. 

We will now derive Eq. (7.7) by using Eq. (7.3) along with our good old friend, 
the binomial expansion; see Eq. (1.21). We’ll assume that n is an integer here. 
Letting a — 1 and b — x/n in Eq. (1.21), the binomial expansion of Eq. (7.3) gives 
(expanding the binomial coefficients and rearranging to obtain the third line) 


e 


X 



(7.10) 


lim 

n —>oo 


lim 

Yl —>00 



1 + x 



x 2 / n(n - 1) \ x 3 / n(n - 1 )(n - 2) 

2! \ n 2 J 3! \ n 3 



This looks roughly like what we’re trying to show in Eq. (7.7), if only we could 
make the terms in parentheses go away. And indeed we can, because in the n —> oo 
limit, all of these terms equal 1. This is true because if n —> oo, then both n — 1 and 
n - 2 are essentially equal to n (in a multiplicative sense). More precisely, the ratios 
(n - 1 )/n and (n - 2 )/n are both equal to 1 if n - oo. So we have 


lim 

n—>oo 


(^) 


= l 


and 


lim 

Yl —>00 


n(n - 1 )(n - 2) 


= 1 , 


(7.11) 


and likewise for the terms associated with higher powers of x. Eq. (7.10) therefore 
becomes Eq. (7.7) in the n —» oo limit. 2 If you have any doubts that Eq. (7.7) holds, 
you should verify with a calculator that it works for, say, x = 2. Going out to the 
10! term should convince you. 

Remark: Another way to convince yourself that Eq. (7.7) is correct is the following. Consider 
what e x looks like if x is a small number, say, x = 0.0001. We have 

gO.0001 _ 1.0001000050001667 ... (7.12) 


This can be written more informatively as 


+ 0.0001 
+ 0.000000005 
+ 0.0000000000001667 ... 


1 + ( 0 . 0001 ) + 


(0.0001) 2 

2 ! 


(o.oooir 

+ 3! + ' 


(7.13) 


in agreement with Eq. (7.7). If you make x even smaller (say, 0.000001), then the same 
pattern will form, but with more zeros between the numbers than in Eq. (7.12). 

2 For any large but finite », the terms in parentheses far out in the series in Eq. (7.10) will eventually 
differ from 1, but by that point the factorials in the denominators will make the terms negligible, so we 
can ignore them. Even if x is large, so that the powers of x in the numerators become large, the factorials 
in the denominators will dominate after a certain point in the series, making the terms negligible. But 
we’re assuming n —» oo anyway, so these issues related to finite n are irrelevant. 
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Eq. (7.13) shows that if e x can be expressed as a sum of powers of x (that is, in the form 
of a + bx + cxr + dx 2 + •••), then a and b must equal 1, c must equal 1 /2, and d must equal 
1/6. If you kept more digits in Eq. (7.12), you could verify the x 4 /4! and jc 5 /5!, etc., terms 
in Eq. (7.7) too. But things aren’t quite as obvious for these, because we don’t have all the 
nice zeros that we have among the first 12 digits of Eq. (7.12). * 


7.2.4 The slope of e x 

Another interesting and important property of e is that if we plot the function f(x) = 
e x , then the slope of the curve 3 at any point equals the value of the function at that 
point, namely e x . For example, in Fig. 7.3 the slope at x = 0 is e° = 1, and the slope 
at x = 2 is e 2 ~ 7.39. (Note the different scales on the x and y axes, which make the 
slopes appear smaller than 1 and 7.39.) The number e is the one special number for 
which this is true. The same thing is not true for, say, 2 X or 10*. The derivation of 
this property is by no means necessary for an understanding of the material in this 
book, but we’ll present it in Appendix D, just for the fun of it. 



Figure 7.3: For any value of x, the slope of the e x curve equals e x . Note the different scales 
on the axes. 

More generally, any function of the form Ae x (where A is a constant) has the 
property that the slope at any point equals the value of the function at that point. 
This is true because both the value and the slope differ by the same factor of A from 
the corresponding quantities in the e x case. (You can think about why this is true 
for the slope.) So if the property holds for e x (which it does), then it also holds for 
Ae x . 


7.3 Appendix C: Approximations to (1 + a) n 

Expressions of the form (1 + a) n come up often in mathematics, especially in prob¬ 
ability. It turns out that if a is small enough (which is invariably the case in the 

3 By “slope” we mean the slope of the line that is tangent to the curve at the given point. You can 

imagine the curve being made out of an actual piece of wire, and if you press a straight stick up against 

it, the stick will form the tangent to the curve at the point of contact. 
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situations we’ll be dealing with), then the following approximate formula holds: 


(1 + a) n * e na 


(7.14) 


This relation is equivalent to Eq. (7.4) if we let a - x/n. 

Eq. (7.14) was critical in our discussion of the exponential and Poisson distribu¬ 
tions in Sections 4.6.3 and 4.7.2. However, when we derived the Gaussian approx¬ 
imations to the binomial and Poisson distributions in Sections 5.1 and 5.3, we saw 
that a more accurate approximation was needed, namely 


(1 + d) n 


e na e ~ na ' 2 


(7.15) 


In the event that a is sufficiently small, the extra factor of e n " 2 / 2 is irrelevant, 
because it is essentially equal to e~° = 1. So Eq. (7.15) reduces to Eq. (7.14). But 
if a isn’t sufficiently small, then the extra factor of e~ na ^ is necessary if we want 
to have a good approximation. Of course, if a is too large, then even the inclusion 
of the e~ na I 1 factor isn’t enough to yield a good approximation. We must tack on 
another factor, or perhaps many factors, as we’ll see in Eq. (7.21) below. 

For an example where the e~" a ^ term in Eq. (7.15) is necessary, let’s say we 
have n = 100 and a = 1/10. Then 


(1 + a) n = (1 + 1/10) 100 * 13,781 and e na = e 10 * 22,026. (7.16) 


So the (1 + a) n » e na approximation in Eq. (7.14) is a very bad one. However, the 
e~ na 12 f ac r or in this case equals e -1 ^ 2 « 0.60653, which yields 

e na e -na 2 n * (22,026)(0.60653) * 13,360. (7.17) 

The (1 + a) n « e na e ~ na2 i 2 approximation in Eq. (7.15) is therefore quite good; 
13,360 differs from the actual value of 13,781 by only about 3%. 4 As an exercise, 
you can show that if we had picked more extreme numbers, say, n = 10,000 and a = 
1/100, then Eq. (7.14) would be a similarly poor approximation, whereas Eq. (7.15) 
would be an excellent one, off by only 0.3%. 

There are various ways to derive Eq. (7.15). The easiest way is to use a little 
calculus. If you want to avoid using calculus, you can still do the derivation, but it is 
rather laborious. Furthermore, if you want to generate better approximations by in¬ 
corporating additional terms, the non-calculus method soon becomes intractable. In 
contrast, the calculus method gives, in one fell swoop, approximations to whatever 
accuracy you desire. We’ll therefore take that route. 

We’ll start with the expression for the sum of a geometric series, 

1 - a + a 2 - a 3 + a 4 - ■ • • = -. (7.18) 

1 + a 

“Whenever we use a sign, we use it in a multiplicative (equivalently, a ratio) sense, and not an 
additive sense. The numbers 13,360 and 13.781 differ by 421, which you might consider to be a large 
number, but that doesn’t matter. The ratio of the numbers is close to 1 (it equals 0.97), so they are 
"approximately equal” in that sense. 
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This is valid for \a\ < 1. (If you plug in, say, a - 2, you will get an obviously 
incorrect statement.) For \a\ < 1, if you keep enough terms on the left, the sum 
will essentially be equal to 1/(1 + a). If you hypothetically keep an infinite number 
of terms, the sum will be exactly equal to 1/(1 + a). You can verify Eq. (7.18) 
by multiplying both sides by 1 + a. On the lefthand side, the infinite number of 
cross terms cancel in pairs, so only the “1” survives. Or, as always, you can just 
plug a small number like a = 0.01 or 0.001 into your calculator if you want some 
reassurance. 

Now is where the calculus comes in. If we integrate both sides of Eq. (7.18) 
with respect to a, we obtain 


a - 




= ln(l + a). 


(7.19) 


where In is the natural log, that is, the log base e. We have used the facts that the 
integral of x k equals x k+l /(k + 1) and the integral of 1 jx equals ln(x). Technically 
there could be a constant of integration in Eq. (7.19), but it is zero. Eq. (7.19) is the 
Taylor series for ln(l + a), just as Eq. (7.7) is the Taylor series for e x . Eq. (7.19) can 
also be derived (as one leams in a calculus class) via the standard way of producing 
a Taylor series, which involves taking a bunch of derivatives. But the above method 
involving the geometric series is simpler. As with Eq. (7.18), Eq. (7.19) is valid for 
\a\ < 1. 

If we now exponentiate both sides of Eq. (7.19), then since c lnl l+ "’ = 1 + a by 
the definition of In, we obtain (reversing the sides of the equation) 


1 + a = e a e~ a 12 e a l \~ a lA e a 15 ■ 


(7.20) 


which again is valid for | a \ < 1. We have used the fact that the exponential of a sum 
is the product of the exponentials. Finally, if we raise both sides of Eq. (7.20) to the 
mh power, we arrive at 


(1 + a) n = e »« e -''« 2 /2 e »« 3 /3 e -na 4 /4 e »« 5 /5 . . . ( 7 . 21 ) 

This relation is valid for \a\ < 1. It is exact if we include an infinite number of the 
exponential factors on the righthand side. However, the question we are concerned 
with here is how many terms we need to keep in order to obtain a good approxima¬ 
tion. (We’ll leave “good" undefined for the moment.) Under what conditions do we 
obtain Eq. (7.14) or Eq. (7.15)? The number of terms we need to keep depends on 
both a and n. In the following cases, we will always assume that a is small (more 
precisely, much smaller than 1). 

• no « 1 

If na <sc 1, then all of the exponents on the righthand side of Eq. (7.21) are 
much smaller than 1. The first one (namely na) is small, by assumption. The 
second one (namely na 2 / 2; we’ll ignore the sign) is also small, because it 
is smaller than na by a factor a (and also by a factor 1/2), which we are 
assuming is small. Likewise, all of the other exponents in subsequent terms 
have additional factors of a and hence are even smaller. Therefore, since all 
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of the exponents in Eq. (7.21) are much smaller than 1, they are, to a good 
approximation, all equal to zero. The exponential factors are therefore all 
approximately equal to e° = 1, so we obtain 

(1 + d) n ~ 1 (valid if na « 1) (7.22) 

An example of a pair of numbers that satisfies na <sc 1 is n = 1 and a — 1/100. 
In this case it is a good approximation to say that (1 + a) n ~ 1. And indeed, 
the exact value of (1 + a) n is (1.01) 1 = 1.01, so the approximation is smaller 
by only 1%. 

• na 2 <K 1 

What if a isn’t small enough to satisfy na «: 1, but is still small enough to 
satisfy na 2 <sc 1? In this case we need to keep the e na term in Eq. (7.21), but 
we can ignore the e~" a ~ term, because it is approximately equal to e~° = 
1. The exponents in subsequent terms are all also essentially equal to zero, 
because they are suppressed by higher powers of a. So Eq. (7.21) becomes 

(1 + a) n ~ e na (valid if na 2 <s 1) (7.23) 

We have therefore derived Eq. (7.14), which we now see is valid when na 2 <s 
1. A pair of numbers that doesn’t satisfy na <sz 1 but does satisfy na 2 <s: 1 
is n = 100 and a = 1/100. In this case it is a good approximation to say that 
(1 + a)' 1 ~ e' la - e 1 - 2.718. And indeed, the exact value of (1 + a)' 1 is 
(1.01) 100 « 2.705, so the approximation is larger by only about 0.5%. The 
(1 + a)' 1 ~ 1 approximation in Eq. (7.22) is not a good one, being smaller 
than the approximation in Eq. (7.23) by a factor of e in the present scenario. 

A special case of Eq. (7.23) occurs when n = 1, which yields 1 + a & e a . 
So we have rederived the e x ~ 1 + x approximation in Eq. (7.9), which we 
obtained from Eq. (7.7). 

As mentioned right after Eq. (7.14), the relation in Eq. (7.4) is equivalent to 
Eq. (7.14)/Eq. (7.23) when a takes on the value x/n. In this case the na 2 <s 1 
condition becomes n(x/n) 2 «: 1 ==> x 2 <s n => x <sz sfii, which is the 
condition stated in Eq. (7.5). But now we know where that condition comes 
from. 

• na 3 <s 1 

What if a isn’t small enough to satisfy na 2 <sc 1, but is still small enough to 
satisfy na 3 <s: 1? In this case we need to keep the term in Eq. (7.21), 

but we can ignore the e" a * ] term, because it is approximately equal to e° = 
1. The exponents in subsequent terms are all also essentially equal to zero, 
because they are suppressed by higher powers of a. So Eq. (7.21) becomes 

(1 + a)' 1 ~ e na e~ na2/2 (valid if na 3 «: 1) (7.24) 

We have therefore derived Eq. (7.15), which we now see is valid when na 3 <sz 
1. A pair of numbers that doesn’t satisfy na 2 <sz 1 but does satisfy na 3 <s 1 
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is n = 10,000 and a = 1/100. In this case it is a good approximation to say 
that (1 + a) n ~ e na e~ nal11 = e m e ~ 1/2 = 1.6304 • 10 43 . And indeed, the 
exact value of (1+a)" is (1.01) 10 000 * 1.63 5 8 • 10 43 , so the approximation is 
smaller by only about 0.3%. The (1 + a) n « e" a approximation in Eq. (7.23) 
is not a good one, being larger than the approximation in Eq. (7.24) by a factor 
of e 132 in the present scenario. 

We can continue in this manner. If a isn’t small enough to satisfy no 3 <sc 1, 
but is still small enough to satisfy na 4 <s 1, then we need to keep the e na 33 term 
in Eq. (7.21), but we can set the e~ na 34 term (and all subsequent terms) equal to 
1. And so on and so forth. However, in this book we’ll never need to go beyond 
the two terms in Eq. (7.15)/Eq. (7.24). Theoretically though, if, say, n = 10 12 and 
a = 1/100, then we need to keep the terms in Eq. (7.21) out to the e~ na 36 term, but 
we can ignore the e na 37 terms and beyond, to a good approximation. 

In any case, the rough size of the (multiplicative) error is the first term in 
Eq. (7.21) that is dropped. This is true because however close the first-dropped 
term is to e° — 1, all of the subsequent exponential factors are even closer to e° — 1. 
In the n — 10,000 and a = 1/100 case in the third bullet point above, the multiplica¬ 
tive error is roughly equal to the e na 33 factor that we dropped, which in this case 
equals e 13300 ss 1.0033. This is approximately the factor by which the true answer 
is larger than the approximate one. 5 This agrees with the results we found above, 
because (1,6358)/( 1.6304) « (1.0033). The true answer is larger by about 0.3% 
(so the approximation is smaller by about 0.3%). 

If this factor of 1.0033 is close enough to 1 for whatever purpose we have in 
mind, then the approximation is a good one. If it isn’t close enough to 1, then we 
need to keep additional terms until it is. In the present example with n = 10,000 and 
a = 1/100, if we keep the e na 33 factor, then the multiplicative error is essentially 
equal to the next term in Eq. (7.21), which is e~ na 4/4 = e _13 40,ooo _ 0.999975. This 
is approximately the factor by which the true answer is smaller than the approximate 
one. The difference is only 0.0025%. 


7.4 Appendix D: The slope of e x 

(Note: This Appendix is for your entertainment only. The results here won’t be 
needed anywhere in this book. But the derivation of the slope of the e x function 
gives us an excuse to play around with some of the properties of e, and also to 
present some of the foundational concepts of calculus.) 


7.4.1 First derivation 

We stated in Section 7.2.4 that the slope of the f(x) = e x function at any point 
equals the value of the function at that point, namely e x . In the language of calculus, 

5 The exponent here is positive, which means that the factor is slightly larger than 1. But note that half 
of the terms in Eq. (7.21) have negative exponents. If one of those terms is the first one that is dropped, 
then the factor is slightly smaller than 1. This is approximately the factor by which the true answer is 
smaller than the approximate one. 
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this is the statement that the derivative (the slope) of e x equals itself, e x . We will 
now show why this is true. 

There are two main ingredients in the derivation. The first is Eq. (7.9). To 
remind ourselves that the x in that equation is assumed to be small, let’s relabel it as 
5, which is a standard letter that mathematicians use for a small quantity. We then 
have 

e 6 *1+6 (for small 5) (7.25) 


The second main ingredient is the strategy of finding the slope of the function 
fix) = e x (or any function, for that matter) at a given point, by first finding an 
approximate slope, and by then making the approximation better and better. This 
proceeds as follows. 

An easy way to make an approximation to the slope of a function at a particular 
value of x, say x = 2, is to find the average slope between x = 2 and a nearby point, 
say, x = 2.1. The average slope of the function fix) = e x between these two points 
is 


rise 

slope = - 

run 


„ 2.1 


0.1 


7.77. 


From Fig. 7.4, however, we see that this approximate slope is 
slope. 6 To produce a better approximation, we can use a closer 
And then an even better approximation can be generated with x 
particular values of x yields slopes of 


(7.26) 

larger than the true 
point, say x = 2.01. 
= 2.001. These two 


slope 


rise 

run 


„2.01 


0.01 


7.43 


and 


e 2.°01 _ g 2 

0.001 


7.393. 


(7.27) 



Figure 7.4: Better and better approximations to the true slope of a curve at a given point. 

If we kept going with smaller and smaller differences from 2, we would find 
that the slopes converge to a certain value, which happens to be about 7.389, as you 
can verify. It is clear from Fig. 7.4 (which, again, is just a picture of a generic¬ 
looking curve) that the approximate slopes swing down and get closer and closer to 

6 The curve in this figure is an arbitrary curve and not the specific e x function, but the general features 
are the same. The curve in the figure is concave upward like the e x function (although the procedure 
we’re discussing is independent of this property). The reason we’re not using the actual e x function here 
is that x = 2.1 is so close to x = 2 that we wouldn’t be able to see the important features. 
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the actual tangent-line slope. This number of 7.389 must therefore be the slope of 
the e x curve at x — 2. 

Now, our goal here is to show that the slope of e x equals e x . We just found that 
the slope at x — 2 equals 7.389, so it had better be true that e 1 also equals 7.389. 
And indeed it does. So at least in the case of x — 2, we have demonstrated that the 
slope of e x equals e x . 

Having learned how to determine the slope at the specific value of x = 2, we can 
now address the case of general x. To find the slope, we can imagine taking a small 
number 6 and calculating the average slope between x and x + 6 (as we did with 2 
and 2.1), and then letting 6 become smaller and smaller. Written out explicitly, the 
formal definition of the slope of a general function fix) at the value x is 


slope = — = lim 
run (5—>o 


| /(* + <?)-/(•*') j 


(7.28) 


This might look a little scary, but it’s simply saying with an equation what Fig. 7.4 
says with a picture: you can get a better and better approximation to the slope by 
looking at the average slope between two points and having the points get closer 
and closer together. 

For the case at hand where our function fix) is e x , we have (with the under¬ 
standing that we’re concerned with the 6 —» 0 limit in all of these steps) 

rise e x+s - e x 

slope = -= - 

run 6 

(factoring out e x ) 

(1 + ^ —- j (using Eq. (7.25)) 

6 
6 

(7.29) 

as we wanted to show. Since we’re concerned with the 6 —» 0 limit (that’s how the 
true slope is obtained), the sign in the third line becomes an “=” sign. So we 
are correct in saying that the slope of the e x curve is exactly equal to e x . 

Note that Eq. (7.25) was critical in this derivation. Eq. (7.29) holds only for 
the special number e, because the e s ^ 1+6 result from Eq. (7.25) that we used 
in the third line holds only for e. The slope of, say, 2 X is not equal to 2 X , because 
Eq. (7.25) doesn’t hold if e is replaced by 2 (or any other number). 

Given that we’re concerned with the 6 —> 0 limit, you might be worried about 
having a 6 in the denominator in Eq. (7.29), since division by zero isn’t allowed. 
But there is also a 6 in the numerator, so you can cancel them first, and then take 
the 6 —» 0 limit, which is trivial because no 6's remain. 
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7.4.2 Second derivation 


In the above derivation, we introduced the strategy of finding the slope by calculat¬ 
ing approximate slopes involving x values that differ by a small number 6. Let’s use 
this strategy to find the slope of a general power-law function, f(x) = x", where n 
is a nonnegative integer. (Note that x is now the number being raised to a power, as 
opposed to the power itself, as was the case with e x .) We’ll then use this result to 
give an alternative derivation of the fact that the slope of e x equals itself, e x . 

We claim that for any value of x, the slope of the function x n is given by: 


slope of x n equals nx" 1 


(7.30) 


(In the language of calculus, this is the statement that the derivative of x n equals 
nx" .) You can quickly verify Eq. (7.30) for the cases of n =0 and n = 1, where 
the slopes are 0 and 1. To demonstrate Eq. (7.30) for a general nonnegative integer 
n (although it actually holds for any n), we can (as we did in the first derivation 
above) find the average slope between x and x + 6, where 6 is small. We can then 
find the true slope by taking the 6 —> 0 limit; see Eq. (7.28). To get a feel for what’s 
going on, let’s start with a specific value of n, say, n = 2. In the same manner as 
in the first derivation, we have (using Eq. (7.28) along with our trusty friend, the 
binomial expansion) 


, rise (x + S) 2 - x 2 

slope = -= - 

run 6 

_ (x 2 + 2x6 + 6 2 ) - x 2 

~ 6 

_ 2x5 + 6 2 

~ 6 

= 2 x + 6. (7.31) 


If we now take the 6 —> 0 limit, the 6 term goes away, leaving us with only the 
2x term. So we have shown that the slope of the x 2 function equals 2x, which is 
consistent with the nx" -1 expression in Eq. (7.30). 

Let’s try the same thing with n — 3. Again using the binomial expansion, we 
have 

rise (x + <5) 3 - x 3 

slope = - = --- 

run <5 

(x 3 + 3 x 2 5 + 3 x6 2 + <J 3 ) - x 3 
" 6 
_ 3 x 2 6 + 3 x6 2 + < 5 3 
~ 6 

= 3x 2 + 3 x 6 + 6 2 . (7.32) 

When we take the 6 —> 0 limit, both of the 3xb and 6 2 terms go away, leaving us 
with only the 3x 2 term. Basically, anything with a 6 in it goes away when we take 
the 6 —> 0 limit. So we have shown that the slope of the x 3 function equals 3x 2 , 
which is again consistent with the nx" -1 expression in Eq. (7.30). 
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You can see how this works for the case of general n. The goal is to calculate 


slope = 


rise 

run 


(x + 6)" - x n 
5 


(7.33) 


Using the binomial expansion, the expressions for (x + 6)" for the first few values 
of n are (you’ll see below why we’ve added the parentheses in the second terms on 
the righthand side): 

(x + <5)° = x°, 

(x + 6) 1 = x 1 + (1)<5, 

(x + 6) 2 = x 2 + (2 x)6 + 6 2 , 

(x + 6) 3 - x 3 + (3 x 2 )6 + 3 x5 2 + 6 3 , 

(x + ci) 4 = x 4 + (4x 3 )<5 + 6x 2 b 2 + 4xS 3 + <5 4 , 

(x + 6) 5 = x 5 + (5 x 4 )6 + 10x 3 b 2 + 10x 2 5 3 + 5 x8 4 + <5 5 . (7.34) 


When we substitute these expressions into Eq. (7.33), the first term disappears when 
we subtract off the x n . Then when we perform the division by 6, we reduce the 
power of 5 by 1 in every term. So at this stage, for each of the expansions in 
Eq. (7.34), the first term has disappeared, the second term involves no d’s, and the 
third and higher terms involve at least one power of 6. Therefore, when we take 
the 6 —> 0 limit, the third and higher terms all go to zero, so we’re left with only 
the second term (without the £). In other words, in each line of Eq. (7.34) we’re 
left with only the term in the parentheses. And this term has the form of rcx" , as 
desired. We have therefore proved Eq. (7.30). The multiplicative factor of n here is 
simply the (") binomial coefficient, because the general form of all of the (x + 6)" 
expansions in Eq. (7.34) is 

+ + + (7.35, 


We can now provide a second derivation of the fact that the slope of e x equals 
itself, e x . This derivation involves writing e x in the form given in Eq. (7.7), which 
we’ll copy here. 


e 


X 


X 

X + 2! + 



(7.36) 


We’ll find the slope of e x by applying Eq. (7.30) to each of these x n jn\ terms. 


Remark: In order to make use of Eq. (7.36) in our derivation, we’ll need to demonstrate that 
the slope of the sum of two functions equals the sum of the slopes of the two functions. And 
also that the “two” here can be replaced by any number. This might seem perfectly believable 
and not necessary to prove, but let’s prove it anyway. We're setting off this proof in a remark, 
in case you want to ignore it. 

Consider a function F(x) that equals the sum of two other functions: F{x) = f\{x) + 
/ 2 (x). We claim that the slope of F(x) at a particular value of x is the sum of the slopes 
of /i(x) and / 2 (x) at that value of x. This follows from the expression for the slope in 
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Eq. (7.28). We have 


slope of F{x) 


rise 

ran 


lim 

( 5—>0 


lim 
( 5—>0 


lim 


F(x + 6)- F(x) \ 

8 ) 

(fi (* + S) + hjx + 6))- (/i (x) + f 2 (x)) | 
Mx + 6) - fi(x) \ ( fiix + S) - f 2 (x) 


6 ->o \ 6 ) <s->o \ 6 

(slope of /| (x)) + (slope of / 2 (x)). 


(7.37) 


The main point here is that in the third line we grouped the f\ terms together, and likewise 
the fi terms. We can do this with any number of functions, of course, so that’s why the above 
“two” can be replaced with any number. We can even have an infinite number of terms, as is 
the case in Eq. (7.36). * 


We now know that the slope of e x equals the sum of the slopes of all the terms 
in Eq. (7.36), of which there are an infinite number. And Eq. (7.30) tells us how to 
find the slope of each term. Let’s look at the first few. 

The slope of the first term in Eq. (7.36) (the 1) is zero. The slope of the second 
term (the x) is 1. The slope of the third term (the x 2 /2!) is (2x)/2! = x. The 
slope of the fourth term (the x 3 /3!) is (3x 2 )/3! = x 2 /2!. For the third and fourth 
terms, we have used the fact that if A is a numerical constant, then the slope of Ax' 1 
equals Anx n ~ l . This quickly follows from Eq. (7.28), because the A can be factored 
outside the parentheses. 

We see that when finding the slope, each term in Eq. (7.36) turns into the pre¬ 
ceding one; this is due to the factorials in the denominators. So the infinite series 
that arises after finding the slope is the same as the original infinite series. In other 
words, the derivative of e x equals itself, e x . Written out explicitly, we have 


Slope of e x = Slope of 


2 3 

X X 

■ x + — + — 
2! 3! 


4! 



= 0 + 



3x 2 4x 3 


3! 4! 



= 0 + 


x 

+ x + V. 



= e 


X 


(7.38) 


as we wanted to show. 

The slope (the derivative) of a function fix) is commonly written as df/dx or 
df (x)jdx, where the c/’s indicate infinitesimal (that is, extremely small) changes. 
The reason for this notation is the following. The numerator in Eq. (7.28) is the 
change in the function / between two x values (namely x and x + S). The denomi¬ 
nator is the change in the x value. The Greek letter A is generally used to denote the 
change in a quantity, so we can write the quotient in Eq. (7.28) as Af/Ax, where Ax¬ 
is simply the t> that we have been using. To find the slope as prescribed by Eq. (7.28), 
we still need to take the 6 —> 0 (or equivalently, the Ax —> 0) limit. Mathematicians 
reserve the letter d for this purpose. While a A can stand for a change of any size, a 
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d is used when it is understood that the change is infinitesimally small. So we have 


slope = lim ^ s d -f . 
Ax->o Ax dx 


(7.39) 


This is just the rise over run, where df is the infinitesimal rise, and dx is the cor¬ 
responding infinitesimal run. Both of these quantities are essentially zero, but their 
ratio (which is the slope) is generally nonzero. In the derivative notation, our above 
results are 


d(e x ) _ 
dx 


and 


d(x n ) 

dx 


(7.40) 


7.5 Appendix E: Important results 

This appendix includes all of the main results in the book. More commentary can 
be found in the Summary section in each chapter. 

Chapter 1 


Permutations: 

ii 

5 

Ordered sets, with repetition: 

N n 

Ordered sets, without repetition: 

n N ' 

N " ~ (. N-n)\ 

Unordered sets, without repetition: 

r Nl 

N '—71 — 1 / A T \ 1 

nl(N - n)\ 

Unordered sets, with repetition: 


Chapter 2 


Equally likely outcomes: 

number of desired outcomes 

p = - 

total number of possible outcomes 

Dependent events: 

P(A and B) = P(A) ■ P(B\A ) 

Independent events: 

P(A and B) = P(A) ■ P(B) 

Nonexclusive events: 

P(A or B) = P(A ) + P(B) - P(A and B) 

Exclusive events: 

P(A or B) = P(A) + P(B) 

Independence: 

P(B\A) = P(B ) or P(A\B) = P(A ) or 

P(A and B) = P(A) ■ P(B) 

Bayes’ theorem (general form): 

P(Z\A k )-P(A k ) 

2 i P(Z|A 1 )-P(A I ) 

Stirling’s formula: 

n\ n u e~ n V2 nn 
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Chapter 3 

Expectation value: 
For arbitrary variables: 
For independent variables: 

Standard deviation: 
For independent variables: 

Biased coin: 
Standard deviation of the mean: 

Variance: 
For independent variables: 

Biased coin: 

Variance of a set of numbers: 

Sample variance: 


E(X) = P1X1 + P2X2 + ■ ■ ■ + PmX,n 
E(X + Y) = E(X)+E(Y) 

E(XY ) = E(X) ■ E(Y) 

trx = ^E[(X-py~] = ^E(X 2) - 

2 2 2 
^X + Y ~ C X + CTy 

tr Heads = y/np(l -p) = y/npq 
<j 

cr x = ~F 2 
y n 

Var(X) =E[(X- p) 2 ] = E(X 2 ) - p 2 
Var(X + F) = Var(X) + Var(F) 
Var(Heads) = npq 


s 2 = 


= Var (S) = - V (xi - x) 2 

n — 1 


—\2 
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Binomial distribution: 
Exponential distribution: 
Poisson distribution: 
Gaussian distribution: 


P(k) = 

Pit) = 

P{k) = 

fix) = 




e-t/T 


or Ae 


-At 


a k e~ a 




g-tx-f) 2 t-CT 1 


Chapter 5 

g-x 1 UlnpP-pti 

Gaussian approx to binomial: 

yj2nnp{\ - p) 

e ~x 2 /2a 

Gaussian approx to Poisson: 
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Chapter 6 

Linear relation: 
Covariance: 

For data points: 
Correlation coefficient: 

For data points: 

Improvement of prediction: 

Probability density p(x,y): 
Lower regression line slope: 
Upper regression line slope: 

Average retest score: 
Slope of least-squares line: 


Y = mX + Z 

Co v(X,Y) = E[(X - p x )(Y - p y )] 


Cov(x,y) = - Y(x, - x)(yt - y) 
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Euler’s number: 

Taylor series for e x : 

For small x: 
An approximation: 

A better approximation: 
Two derivatives: 


e = lim (l + -) ~ 2.71828 

n —>oo \ H j 

2 3 4 

r , X X X 

e -1+x+ — + — + — + • 

2! 3! 4! 

e x ~ 1 + x 
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= e , —;— = nx 


d(e x ) 

dx 


dx 
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Chapter 1 


Factorial: 

Permutations: 

Ordered subgroups: 
Unordered subgroups: 
Binomial coefficient: 
Unordered sets with repetitions: 


N\ = 1 - 2-3 • (N- 1) • N 
P n =NI 


N Pfi — 


N\ 

(N - n)\ 


N Gn — 

(N\ 


N\ 

nl(N - n)\ 
N\ 


\n) 

n G n 


n\(N - n)\ 

/« + (TV - 1)\ 
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Probability: 
Probability of event A: 
Intersection (joint) probability: 
Conditional probability: 
Union probability: 

Not A: 


P 

P(A) 

P(A and B), P(A n B) 
P(B\A) 

Pi A or B), P(A U B) 
~A 
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Random variable: 
Value of random variable: 
Expectation value: 
Standard deviation: 
Standard deviation of the mean: 

Variance: 
Set of numbers: 

Mean of set S: 

Variance of set S: 

Sample variance of set S: 


X (uppercase) 
x (lowercase) 
E(X), p x , Px, P 


cr x , cr 


& av g 

Var(X), cr 2 x , cr 2 
S 


X, (x) = - V Xi 
n 

1 " 

Var(5), s 2 = - 'S' (x,- - x) 2 

n ^ 


5 2 S 


- Yj ( x i - x ) 2 
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Chapter 4 

Probability: 
Probability density: 
Much greater than (multiplicatively): 
Much less than (multiplicatively): 
Approximately equal (multiplicatively): 

Chapter 5 

Number of trials in an experiment: 
Number of sets of n t trials: 

Chapter 6 

Slope of (lower) regression line: 
Correlation coefficient: 
Covariance of distribution: 
Covariance of data points: 

Joint probability density: 
Slope of least-squares line: 
y-intercept of least-squares line: 


Chapter 7 


Euler’s number: 


P(x ) (uppercase) 
p(x), f{x), etc. (lowercase) 
» 


«t 

n s 

m 

r 

Co v(X,Y) 
Cov(x,y) 

p(x,y) 

A 

B 


e ~ 2.71828 
dfjx) 
dx 


Derivative (slope) of fix): 
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Introduction 

Why I hated calculus but love statistics 


I have always had an uncomfortable relationship with math. I don’t like 
numbers for the sake of numbers. I am not impressed by fancy formulas 
that have no real-world application. I particularly disliked high school 
calculus for the simple reason that no one ever bothered to tell me why I 
needed to learn it. What is the area beneath a parabola? Who cares? 

In fact, one of the great moments of my life occurred during my senior 
year of high school, at the end of the first semester of Advanced Placement 
Calculus. I was working away on the final exam, admittedly less prepared 
for the exam than I ought to have been. (I had been accepted to my first- 
choice college a few weeks earlier, which had drained away what little 
motivation I had for the course.) As I stared at the final exam questions, 
they looked completely unfamiliar. I don’t mean that I was having trouble 
answering the questions. I mean that I didn’t even recognize what was 
being asked. I was no stranger to being unprepared for exams, but, to 
paraphrase Donald Rumsfeld, I usually knew what I didn’t know. This 
exam looked even more Greek than usual. I flipped through the pages of the 
exam for a while and then more or less surrendered. I walked to the front of 
the classroom, where my calculus teacher, whom we’ll call Carol Smith, 
was proctoring the exam. “Mrs. Smith,” I said, “I don’t recognize a lot of 
the stuff on the test.” 

Suffice it to say that Mrs. Smith did not like me a whole lot more than I 
liked her. Yes, I can now admit that I sometimes used my limited powers as 
student association president to schedule all-school assemblies just so that 
Mrs. Smith’s calculus class would be canceled. Yes, my friends and I did 
have flowers delivered to Mrs. Smith during class from “a secret admirer” 
just so that we could chortle away in the back of the room as she looked 


around in embarrassment. And yes, I did stop doing any homework at all 
once I got in to college. 

So when I walked up to Mrs. Smith in the middle of the exam and said 
that the material did not look familiar, she was, well, unsympathetic. 
“Charles,” she said loudly, ostensibly to me but facing the rows of desks to 
make certain that the whole class could hear, “if you had studied, the 
material would look a lot more familiar.” This was a compelling point. 

So I slunk back to my desk. After a few minutes, Brian Arbetter, a far 
better calculus student than I, walked to the front of the room and 
whispered a few things to Mrs. Smith. She whispered back and then a truly 
extraordinary thing happened. “Class, I need your attention,” Mrs. Smith 
announced. “It appears that I have given you the second semester exam by 
mistake.” We were far enough into the test period that the whole exam had 
to be aborted and rescheduled. 

I cannot fully describe my euphoria. I would go on in life to marry a 
wonderful woman. We have three healthy children. I’ve published books 
and visited places like the Taj Mahal and Angkor Wat. Still, the day that my 
calculus teacher got her comeuppance is a top five life moment. (The fact 
that I nearly failed the makeup final exam did not significantly diminish this 
wonderful life experience.) 

The calculus exam incident tells you much of what you need to know 
about my relationship with mathematics—but not everything. Curiously, I 
loved physics in high school, even though physics relies very heavily on the 
very same calculus that I refused to do in Mrs. Smith’s class. Why? Because 
physics has a clear purpose. I distinctly remember my high school physics 
teacher showing us during the World Series how we could use the basic 
formula for acceleration to estimate how far a home run had been hit. That’s 
cool—and the same formula has many more socially significant 
applications. 

Once I arrived in college, I thoroughly enjoyed probability, again because 
it offered insight into interesting real-life situations. In hindsight, I now 
recognize that it wasn’t the math that bothered me in calculus class; it was 
that no one ever saw fit to explain the point of it. If you’re not fascinated by 
the elegance of formulas alone—which I am most emphatically not—then it 
is just a lot of tedious and mechanistic formulas, at least the way it was 
taught to me. 



That brings me to statistics (which, for the purposes of this book, 
includes probability). I love statistics. Statistics can be used to explain 
everything from DNA testing to the idiocy of playing the lottery. Statistics 
can help us identify the factors associated with diseases like cancer and 
heart disease; it can help us spot cheating on standardized tests. Statistics 
can even help you win on game shows. There was a famous program during 
my childhood called Let’s Make a Deal, with its equally famous host, 
Monty Hall. At the end of each day’s show, a successful player would stand 
with Monty facing three big doors: Door no. 1, Door no. 2, and Door no. 3. 
Monty Hall explained to the player that there was a highly desirable prize 
behind one of the doors—something like a new car—and a goat behind the 
other two. The idea was straightforward: the player chose one of the doors 
and would get the contents behind that door. 

As each player stood facing the doors with Monty Hall, he or she had a 1 
in 3 chance of choosing the door that would be opened to reveal the 
valuable prize. But Let’s Make a Deal had a twist, which has delighted 
statisticians ever since (and perplexed everyone else). After the player 
chose a door, Monty Hall would open one of the two remaining doors, 
always revealing a goat. For the sake of example, assume that the player has 
chosen Door no. 1. Monty would then open Door no. 3; the live goat would 
be standing there on stage. Two doors would still be closed, nos. 1 and 2. If 
the valuable prize was behind no. 1, the contestant would win; if it was 
behind no. 2, he would lose. But then things got more interesting: Monty 
would turn to the player and ask whether he would like to change his mind 
and switch doors (from no. 1 to no. 2 in this case). Remember, both doors 
were still closed, and the only new information the contestant had received 
was that a goat showed up behind one of the doors that he didn’t pick. 

Should he switch? 

The answer is yes. Why? That’s in Chapter SV 2 . 

The paradox of statistics is that they are everywhere—from batting 
averages to presidential polls—but the discipline itself has a reputation for 
being uninteresting and inaccessible. Many statistics books and classes are 
overly laden with math and jargon. Believe me, the technical details are 
crucial (and interesting)—but it’s just Greek if you don’t understand the 
intuition. And you may not even care about the intuition if you’re not 



convinced that there is any reason to learn it. Every chapter in this book 
promises to answer the basic question that I asked (to no effect) of my high 
school calculus teacher: What is the point of this? 

This book is about the intuition. It is short on math, equations, and 
graphs; when they are used, I promise that they will have a clear and 
enlightening purpose. Meanwhile, the book is long on examples to convince 
you that there are great reasons to learn this stuff. Statistics can be really 
interesting, and most of it isn't that difficult. 

The idea for this book was born not terribly long after my unfortunate 
experience in Mrs. Smith’s AP Calculus class. I went to graduate school to 
study economics and public policy. Before the program even started, I was 
assigned (not surprisingly) to “math camp” along with the bulk of my 
classmates to prepare us for the quantitative rigors that were to follow. For 
three weeks, we learned math all day in a windowless, basement classroom 
(really). 

On one of those days, I had something very close to a career epiphany. 
Our instructor was trying to teach us the circumstances under which the 
sum of an infinite series converges to a finite number. Stay with me here for 
a minute because this concept will become clear. (Right now you’re 
probably feeling the way I did in that windowless classroom.) An infinite 
series is a pattern of numbers that goes on forever, such as 1 + V 2 + V 4 + Vs . 
. . The three dots means that the pattern continues to infinity. 

This is the part we were having trouble wrapping our heads around. Our 
instructor was trying to convince us, using some proof I’ve long since 
forgotten, that a series of numbers can go on forever and yet still add up 
(roughly) to a finite number. One of my classmates. Will Warshauer, would 
have none of it, despite the impressive mathematical proof. (To be honest, I 
was a bit skeptical myself.) How can something that is infinite add up to 
something that is finite? 

Then I got an inspiration, or more accurately, the intuition of what the 
instructor was trying to explain. I turned to Will and talked him through 
what I had just worked out in my head. Imagine that you have positioned 
yourself exactly 2 feet from a wall. 

Now move half the distance to that wall (1 foot), so that you are left 
standing 1 foot away. 



From 1 foot away, move half the distance to the wall once again (6 
inches, or V 2 a foot). And from 6 inches away, do it again (move 3 inches, 
or V4 of a foot). Then do it again (move IV2 inches, or Vs of a foot). And so 
on. 

You will gradually get pretty darn close to the wall. (For example, when 
you are 1/1024th of an inch from the wall, you will move half the distance, 
or another l/2048th of an inch.) But you will never hit the wall, because by 
definition each move takes you only half the remaining distance. In other 
words, you will get infinitely close to the wall but never hit it. If we 
measure your moves in feet, the series can be described as 1 + V 2 + V4 + Vs . 

Therein lies the insight: Even though you will continue moving forever 
—with each move taking you half the remaining distance to the wall—the 
total distance you travel can never be more than 2 feet, which is your 
starting distance from the wall. For mathematical purposes, the total 
distance you travel can be approximated as 2 feet, which turns out to be 
very handy for computation purposes. A mathematician would say that the 
sum of this infinite series 1 ft + V2 ft + V4 ft + Vs ft. . . converges to 2 feet, 
which is what our instructor was trying to teach us that day. 

The point is that I convinced Will. I convinced myself. I can’t remember 
the math proving that the sum of an infinite series can converge to a finite 
number, but I can always look that up online. And when I do, it will 
probably make sense. In my experience, the intuition makes the math and 
other technical details more understandable—but not necessarily the other 
way around. 

The point of this book is to make the most important statistical concepts 
more intuitive and more accessible, not just for those of us forced to study 
them in windowless classrooms but for anyone interested in the 
extraordinary power of numbers and data. 

Now, having just made the case that the core tools of statistics are less 
intuitive and accessible than they ought to be, I’m going to make a 
seemingly contradictory point: Statistics can be overly accessible in the 
sense that anyone with data and a computer can do sophisticated statistical 
procedures with a few keystrokes. The problem is that if the data are poor, 
or if the statistical techniques are used improperly, the conclusions can be 



wildly misleading and even potentially dangerous. Consider the following 
hypothetical Internet news flash: People Who Take Short Breaks at Work 
Are Far More Likely to Die of Cancer. Imagine that headline popping up 
while you are surfing the Web. According to a seemingly impressive study 
of 36,000 office workers (a huge data set!), those workers who reported 
leaving their offices to take regular ten-minute breaks during the workday 
were 41 percent more likely to develop cancer over the next five years than 
workers who don’t leave their offices during the workday. Clearly we need 
to act on this kind of finding—perhaps some kind of national awareness 
campaign to prevent short breaks on the job. 

Or maybe we just need to think more clearly about what many workers 
are doing during that ten-minute break. My professional experience 
suggests that many of those workers who report leaving their offices for 
short breaks are huddled outside the entrance of the building smoking 
cigarettes (creating a haze of smoke through which the rest of us have to 
walk in order to get in or out). I would further infer that it’s probably the 
cigarettes, and not the short breaks from work, that are causing the cancer. 
I’ve made up this example just so that it would be particularly absurd, but I 
can assure you that many real-life statistical abominations are nearly this 
absurd once they are deconstructed. 

Statistics is like a high-caliber weapon: helpful when used correctly and 
potentially disastrous in the wrong hands. This book will not make you a 
statistical expert; it will teach you enough care and respect for the field that 
you don’t do the statistical equivalent of blowing someone’s head off. 

This is not a textbook, which is liberating in terms of the topics that have 
to be covered and the ways in which they can be explained. The book has 
been designed to introduce the statistical concepts with the most relevance 
to everyday life. How do scientists conclude that something causes cancer? 
How does polling work (and what can go wrong)? Who “lies with 
statistics,” and how do they do it? How does your credit card company use 
data on what you are buying to predict if you are likely to miss a payment? 
(Seriously, they can do that.) 

If you want to understand the numbers behind the news and to appreciate 
the extraordinary (and growing) power of data, this is the stuff you need to 
know. In the end, I hope to persuade you of the observation first made by 



Swedish mathematician and writer Andrejs Dunkels: It’s easy to lie with 
statistics, but it’s hard to tell the truth without them. 

But I have even bolder aspirations than that. I think you might actually 
enjoy statistics. The underlying ideas are fabulously interesting and 
relevant. The key is to separate the important ideas from the arcane 
technical details that can get in the way. That is Naked Statistics. 



CHAPTER 1 


What’s the Point? 


I’ve noticed a curious phenomenon. Students will complain that statistics is 
confusing and irrelevant. Then the same students will leave the classroom 
and happily talk over lunch about batting averages (during the summer) or 
the windchill factor (during the winter) or grade point averages (always). 
They will recognize that the National Football League’s “passer rating”—a 
statistic that condenses a quarterback’s performance into a single number— 
is a somewhat flawed and arbitrary measure of a quarterback’s game day 
performance. The same data (completion rate, average yards per pass 
attempt, percentage of touchdown passes per pass attempt, and interception 
rate) could be combined in a different way, such as giving greater or lesser 
weight to any of those inputs, to generate a different but equally credible 
measure of performance. Yet anyone who has watched football recognizes 
that it’s handy to have a single number that can be used to encapsulate a 
quarterback’s performance. 

Is the quarterback rating perfect? No. Statistics rarely offers a single 
“right” way of doing anything. Does it provide meaningful information in 
an easily accessible way? Absolutely. It’s a nice tool for making a quick 
comparison between the performances of two quarterbacks on a given day. I 
am a Chicago Bears fan. During the 2011 playoffs, the Bears played the 
Packers; the Packers won. There are a lot of ways I could describe that 
game, including pages and pages of analysis and raw data. But here is a 
more succinct analysis. Chicago Bears quarterback Jay Cutler had a passer 
rating of 31.8. In contrast, Green Bay quarterback Aaron Rodgers had a 
passer rating of 55.4. Similarly, we can compare Jay Cutler’s performance 
to that in a game earlier in the season against Green Bay, when he had a 
passer rating of 85.6. That tells you a lot of what you need to know in order 
to understand why the Bears beat the Packers earlier in the season but lost 
to them in the playoffs. 


That is a very helpful synopsis of what happened on the field. Does it 
simplify things? Yes, that is both the strength and the weakness of any 
descriptive statistic. One number tells you that Jay Cutler was outgunned by 
Aaron Rodgers in the Bears’ playoff loss. On the other hand, that number 
won’t tell you whether a quarterback had a bad break, such as throwing a 
perfect pass that was hobbled by the receiver and then intercepted, or 
whether he “stepped up” on certain key plays (since every completion is 
weighted the same, whether it is a crucial third down or a meaningless play 
at the end of the game), or whether the defense was terrible. And so on. 

The curious thing is that the same people who are perfectly comfortable 
discussing statistics in the context of sports or the weather or grades will 
seize up with anxiety when a researcher starts to explain something like the 
Gini index, which is a standard tool in economics for measuring income 
inequality. I’ll explain what the Gini index is in a moment, but for now the 
most important thing to recognize is that the Gini index is just like the 
passer rating. It’s a handy tool for collapsing complex information into a 
single number. As such, it has the strengths of most descriptive statistics, 
namely that it provides an easy way to compare the income distribution in 
two countries, or in a single country at different points in time. 

The Gini index measures how evenly wealth (or income) is shared within 
a country on a scale from zero to one. The statistic can be calculated for 
wealth or for annual income, and it can be calculated at the individual level 
or at the household level. (All of these statistics will be highly correlated 
but not identical.) The Gini index, like the passer rating, has no intrinsic 
meaning; it’s a tool for comparison. A country in which every household 
had identical wealth would have a Gini index of zero. By contrast, a country 
in which a single household held the country’s entire wealth would have a 
Gini index of one. As you can probably surmise, the closer a country is to 
one, the more unequal its distribution of wealth. The United States has a 
Gini index of .45, according to the Central Intelligence Agency (a great 
collector of statistics, by the way). 1 So what? 

Once that number is put into context, it can tell us a lot. For example, 
Sweden has a Gini index of .23. Canada’s is .32. China’s is .42. Brazil’s is 
.54. South Africa’s is .65.* As we look across those numbers, we get a sense 
of where the United States falls relative to the rest of the world when it 


comes to income inequality. We can also compare different points in time. 
The Gini index for the United States was .41 in 1997 and grew to .45 over 
the next decade. (The most recent CIA data are for 2007.) This tells us in an 
objective way that while the United States grew richer over that period of 
time, the distribution of wealth grew more unequal. Again, we can compare 
the changes in the Gini index across countries over roughly the same time 
period. Inequality in Canada was basically unchanged over the same 
stretch. Sweden has had significant economic growth over the past two 
decades, but the Gini index in Sweden actually fell from .25 in 1992 to .23 
in 2005, meaning that Sweden grew richer and more equal over that period. 

Is the Gini index the perfect measure of inequality? Absolutely not—just 
as the passer rating is not a perfect measure of quarterback performance. 
But it certainly gives us some valuable information on a socially significant 
phenomenon in a convenient format. 

We have also slowly backed our way into answering the question posed 
in the chapter title: What is the point? The point is that statistics helps us 
process data, which is really just a fancy name for information. Sometimes 
the data are trivial in the grand scheme of things, as with sports statistics. 
Sometimes they offer insight into the nature of human existence, as with the 
Gini index. 

But, as any good infomercial would point out. That's not all! Hal Varian, 
chief economist at Google, told the New York Times that being a statistician 
will be “the sexy job” over the next decade. 2 I’ll be the first to concede that 
economists sometimes have a warped definition of “sexy.” Still, consider 
the following disparate questions: 

How can we catch schools that are cheating on their standardized tests? 

How does Netflix know what kind of movies you like? 

How can we figure out what substances or behaviors cause cancer, given 
that we cannot conduct cancer-causing experiments on humans? 

Does praying for surgical patients improve their outcomes? 

Is there really an economic benefit to getting a degree from a highly 
selective college or university? 

What is causing the rising incidence of autism? 

Statistics can help answer these questions (or, we hope, can soon). The 
world is producing more and more data, ever faster and faster. Yet, as the 


New York Times has noted, “Data is merely the raw material of 
knowledge.” 3 * Statistics is the most powerful tool we have for using 
information to some meaningful end, whether that is identifying underrated 
baseball players or paying teachers more fairly. Here is a quick tour of how 
statistics can bring meaning to raw data. 


Description and Comparison 

A bowling score is a descriptive statistic. So is a batting average. Most 
American sports fans over the age of five are already conversant in the field 
of descriptive statistics. We use numbers, in sports and everywhere else in 
life, to summarize information. How good a baseball player was Mickey 
Mantle? He was a career .298 hitter. To a baseball fan, that is a meaningful 
statement, which is remarkable when you think about it, because it 
encapsulates an eighteen-season career. 4 (There is, I suppose, something 
mildly depressing about having one’s lifework collapsed into a single 
number.) Of course, baseball fans have also come to recognize that 
descriptive statistics other than batting average may better encapsulate a 
player’s value on the field. 

We evaluate the academic performance of high school and college 
students by means of a grade point average, or GPA. A letter grade is 
assigned a point value; typically an A is worth 4 points, a B is worth 3, a C 
is worth 2, and so on. By graduation, when high school students are 
applying to college and college students are looking for jobs, the grade 
point average is a handy tool for assessing their academic potential. 
Someone who has a 3.7 GPA is clearly a stronger student than someone at 
the same school with a 2.5 GPA. That makes it a nice descriptive statistic. 
It’s easy to calculate, it’s easy to understand, and it’s easy to compare across 
students. 

But it’s not perfect. The GPA does not reflect the difficulty of the courses 
that different students may have taken. How can we compare a student with 
a 3.4 GPA in classes that appear to be relatively nonchallenging and a 
student with a 2.9 GPA who has taken calculus, physics, and other tough 
subjects? I went to a high school that attempted to solve this problem by 
giving extra weight to difficult classes, so that an A in an “honors” class 


was worth five points instead of the usual four. This caused its own 
problems. My mother was quick to recognize the distortion caused by this 
GPA “fix.” For a student taking a lot of honors classes (me), any A in a 
nonhonors course, such as gym or health education, would actually pull my 
GPA down, even though it is impossible to do better than an A in those 
classes. As a result, my parents forbade me to take driver’s education in 
high school, lest even a perfect performance diminish my chances of getting 
into a competitive college and going on to write popular books. Instead, 
they paid to send me to a private driving school, at nights over the summer. 

Was that insane? Yes. But one theme of this book will be that an 
overreliance on any descriptive statistic can lead to misleading conclusions, 
or cause undesirable behavior. My original draft of that sentence used the 
phrase “oversimplified descriptive statistic,” but I struck the word 
“oversimplified” because it’s redundant. Descriptive statistics exist to 
simplify, which always implies some loss of nuance or detail. Anyone 
working with numbers needs to recognize as much. 


Inference 

How many homeless people live on the streets of Chicago? How often do 
married people have sex? These may seem like wildly different kinds of 
questions; in fact, they both can be answered (not perfectly) by the use of 
basic statistical tools. One key function of statistics is to use the data we 
have to make informed conjectures about larger questions for which we do 
not have full information. In short, we can use data from the “known world” 
to make informed inferences about the “unknown world.” 

Let’s begin with the homeless question. It is expensive and logistically 
difficult to count the homeless population in a large metropolitan area. Yet 
it is important to have a numerical estimate of this population for purposes 
of providing social services, earning eligibility for state and federal 
revenues, and gaining congressional representation. One important 
statistical practice is sampling, which is the process of gathering data for a 
small area, say, a handful of census tracts, and then using those data to 
make an informed judgment, or inference, about the homeless population 



for the city as a whole. Sampling requires far less resources than trying to 
count an entire population; done properly, it can be every bit as accurate. 

A political poll is one form of sampling. A research organization will 
attempt to contact a sample of households that are broadly representative of 
the larger population and ask them their views about a particular issue or 
candidate. This is obviously much cheaper and faster than trying to contact 
every household in an entire state or country. The polling and research firm 
Gallup reckons that a methodologically sound poll of 1,000 households will 
produce roughly the same results as a poll that attempted to contact every 
household in America. 

That’s how we figured out how often Americans are having sex, with 
whom, and what kind. In the mid-1990s, the National Opinion Research 
Center at the University of Chicago carried out a remarkably ambitious 
study of American sexual behavior. The results were based on detailed 
surveys conducted in person with a large, representative sample of 
American adults. If you read on. Chapter 10 will tell you what they learned. 
How many other statistics books can promise you that? 


Assessing Risk and Other Probability-Related Events 

Casinos make money in the long run—always. That does not mean that they 
are making money at any given moment. When the bells and whistles go 
off, some high roller has just won thousands of dollars. The whole gambling 
industry is built on games of chance, meaning that the outcome of any 
particular roll of the dice or turn of the card is uncertain. At the same time, 
the underlying probabilities for the relevant events—drawing 21 at 
blackjack or spinning red in roulette—are known. When the underlying 
probabilities favor the casinos (as they always do), we can be increasingly 
certain that the “house” is going to come out ahead as the number of bets 
wagered gets larger and larger, even as those bells and whistles keep going 
off. 

This turns out to be a powerful phenomenon in areas of life far beyond 
casinos. Many businesses must assess the risks associated with assorted 
adverse outcomes. They cannot make those risks go away entirely, just as a 
casino cannot guarantee that you won’t win every hand of blackjack that 



you play. However, any business facing uncertainty can manage these risks 
by engineering processes so that the probability of an adverse outcome, 
anything from an environmental catastrophe to a defective product, 
becomes acceptably low. Wall Street firms will often evaluate the risks 
posed to their portfolios under different scenarios, with each of those 
scenarios weighted based on its probability. The financial crisis of 2008 was 
precipitated in part by a series of market events that had been deemed 
extremely unlikely, as if every player in a casino drew blackjack all night. I 
will argue later in the book that these Wall Street models were flawed and 
that the data they used to assess the underlying risks were too limited, but 
the point here is that any model to deal with risk must have probability as 
its foundation. 

When individuals and firms cannot make unacceptable risks go away, 
they seek protection in other ways. The entire insurance industry is built 
upon charging customers to protect them against some adverse outcome, 
such as a car crash or a house fire. The insurance industry does not make 
money by eliminating these events; cars crash and houses burn every day. 
Sometimes cars even crash into houses, causing them to burn. Instead, the 
insurance industry makes money by charging premiums that are more than 
sufficient to pay for the expected payouts from car crashes and house fires. 
(The insurance company may also try to lower its expected payouts by 
encouraging safe driving, fences around swimming pools, installation of 
smoke detectors in every bedroom, and so on.) 

Probability can even be used to catch cheats in some situations. The firm 
Caveon Test Security specializes in what it describes as “data forensics” to 
find patterns that suggest cheating. 5 For example, the company (which was 
founded by a former test developer for the SAT) will flag exams at a school 
or test site on which the number of identical wrong answers is highly 
unlikely, usually a pattern that would happen by chance less than one time 
in a million. The mathematical logic stems from the fact that we cannot 
learn much when a large group of students all answer a question correctly. 
That’s what they are supposed to do; they could be cheating, or they could 
be smart. But when those same test takers get an answer wrong, they should 
not all consistently have the same wrong answer. If they do, it suggests that 
they are copying from one another (or sharing answers via text). The 


company also looks for exams in which a test taker does significantly better 
on hard questions than on easy questions (suggesting that he or she had 
answers in advance) and for exams on which the number of “wrong to 
right” erasures is significantly higher than the number of “right to wrong” 
erasures (suggesting that a teacher or administrator changed the answer 
sheets after the test). 

Of course, you can see the limitations of using probability. A large group 
of test takers might have the same wrong answers by coincidence; in fact, 
the more schools we evaluate, the more likely it is that we will observe such 
patterns just as a matter of chance. A statistical anomaly does not prove 
wrongdoing. Delma Kinney, a fifty-year-old Atlanta man, won $1 million in 
an instant lottery game in 2008 and then another $1 million in an instant 
game in 2011. The probability of that happening to the same person is 
somewhere in the range of 1 in 25 trillion. We cannot arrest Mr. Kinney for 
fraud on the basis of that calculation alone (though we might inquire 
whether he has any relatives who work for the state lottery). Probability is 
one weapon in an arsenal that requires good judgment. 


Identifying Important Relationships 
(Statistical Detective Work) 

Does smoking cigarettes cause cancer? We have an answer for that question 
—but the process of answering it was not nearly as straightforward as one 
might think. The scientific method dictates that if we are testing a scientific 
hypothesis, we should conduct a controlled experiment in which the 
variable of interest (e.g., smoking) is the only thing that differs between the 
experimental group and the control group. If we observe a marked 
difference in some outcome between the two groups (e.g., lung cancer), we 
can safely infer that the variable of interest is what caused that outcome. We 
cannot do that kind of experiment on humans. If our working hypothesis is 
that smoking causes cancer, it would be unethical to assign recent college 
graduates to two groups, smokers and nonsmokers, and then see who has 
cancer at the twentieth reunion. (We can conduct controlled experiments on 
humans when our hypothesis is that a new drug or treatment may improve 


their health; we cannot knowingly expose human subjects when we expect 
an adverse outcome.) 

Now, you might point out that we do not need to conduct an ethically 
dubious experiment to observe the effects of smoking. Couldn’t we just skip 
the whole fancy methodology and compare cancer rates at the twentieth 
reunion between those who have smoked since graduation and those who 
have not? 

No. Smokers and nonsmokers are likely to be different in ways other 
than their smoking behavior. For example, smokers may be more likely to 
have other habits, such as drinking heavily or eating badly, that cause 
adverse health outcomes. If the smokers are particularly unhealthy at the 
twentieth reunion, we would not know whether to attribute this outcome to 
smoking or to other unhealthy things that many smokers happen to do. We 
would also have a serious problem with the data on which we are basing 
our analysis. Smokers who have become seriously ill with cancer are less 
likely to attend the twentieth reunion. (The dead smokers definitely won’t 
show up.) As a result, any analysis of the health of the attendees at the 
twentieth reunion (related to smoking or anything else) will be seriously 
flawed by the fact that the healthiest members of the class are the most 
likely to show up. The further the class gets from graduation, say, a fortieth 
or a fiftieth reunion, the more serious this bias will be. 

We cannot treat humans like laboratory rats. As a result, statistics is a lot 
like good detective work. The data yield clues and patterns that can 
ultimately lead to meaningful conclusions. You have probably watched one 
of those impressive police procedural shows like CSI: New York in which 
very attractive detectives and forensic experts pore over minute clues— 
DNA from a cigarette butt, teeth marks on an apple, a single fiber from a 
car floor mat—and then use the evidence to catch a violent criminal. The 
appeal of the show is that these experts do not have the conventional 
evidence used to find the bad guy, such as an eyewitness or a surveillance 
videotape. So they turn to scientific inference instead. Statistics does 
basically the same thing. The data present unorganized clues—the crime 
scene. Statistical analysis is the detective work that crafts the raw data into 
some meaningful conclusion. 


After Chapter 11, you will appreciate the television show I hope to pitch: 
CSI: Regression Analysis, which would be only a small departure from 
those other action-packed police procedurals. Regression analysis is the tool 
that enables researchers to isolate a relationship between two variables, 
such as smoking and cancer, while holding constant (or “controlling for”) 
the effects of other important variables, such as diet, exercise, weight, and 
so on. When you read in the newspaper that eating a bran muffin every day 
will reduce your chances of getting colon cancer, you need not fear that 
some unfortunate group of human experimental subjects has been force-fed 
bran muffins in the basement of a federal laboratory somewhere while the 
control group in the next building gets bacon and eggs. Instead, researchers 
will gather detailed information on thousands of people, including how 
frequently they eat bran muffins, and then use regression analysis to do two 
crucial things: (1) quantify the association observed between eating bran 
muffins and contracting colon cancer (e.g., a hypothetical finding that 
people who eat bran muffins have a 9 percent lower incidence of colon 
cancer, controlling for other factors that may affect the incidence of the 
disease); and (2) quantify the likelihood that the association between bran 
muffins and a lower rate of colon cancer observed in this study is merely a 
coincidence—a quirk in the data for this sample of people—rather than a 
meaningful insight about the relationship between diet and health. 

Of course, CSI: Regression Analysis will star actors and actresses who 
are much better looking than the academics who typically pore over such 
data. These hotties (all of whom would have PhDs, despite being only 
twenty-three years old) would study large data sets and use the latest 
statistical tools to answer important social questions: What are the most 
effective tools for fighting violent crime? What individuals are most likely 
to become terrorists? Later in the book we will discuss the concept of a 
“statistically significant” finding, which means that the analysis has 
uncovered an association between two variables that is not likely to be the 
product of chance alone. For academic researchers, this kind of statistical 
finding is the “smoking gun.” On CSI: Regression Analysis, I envision a 
researcher working late at night in the computer lab because of her daytime 
commitment as a member of the U.S. Olympic beach volleyball team. 
When she gets the printout from her statistical analysis, she sees exactly 
what she has been looking for: a large and statistically significant 



relationship in her data set between some variable that she had hypothesized 
might be important and the onset of autism. She must share this 
breakthrough immediately! 

The researcher takes the printout and runs down the hall, slowed 
somewhat by the fact that she is wearing high heels and a relatively small, 
tight black skirt. She finds her male partner, who is inexplicably fit and tan 
for a guy who works fourteen hours a day in a basement computer lab, and 
shows him the results. He runs his fingers through his neatly trimmed 
goatee, grabs his Glock 9-mm pistol from the desk drawer, and slides it into 
the shoulder holster beneath his $5,000 Hugo Boss suit (also inexplicable 
given his starting academic salary of $38,000 a year). Together the 
regression analysis experts walk briskly to see their boss, a grizzled veteran 
who has overcome failed relationships and a drinking problem . .. 

Okay, you don’t have to buy into the television drama to appreciate the 
importance of this kind of statistical research. Just about every social 
challenge that we care about has been informed by the systematic analysis 
of large data sets. (In many cases, gathering the relevant data, which is 
expensive and time-consuming, plays a crucial role in this process as will 
be explained in Chapter 7.) I may have embellished my characters in CSI: 
Regression Analysis but not the kind of significant questions they could 
examine. There is an academic literature on terrorists and suicide bombers 
—a subject that would be difficult to study by means of human subjects (or 
lab rats for that matter). One such book. What Makes a Terrorist, was 
written by one of my graduate school statistics professors. The book draws 
its conclusions from data gathered on terrorist attacks around the world. A 
sample finding: Terrorists are not desperately poor, or poorly educated. The 
author, Princeton economist Alan Krueger, concludes, “Terrorists tend to be 
drawn from well-educated, middle-class or high-income families.” 7 

Why? Well, that exposes one of the limitations of regression analysis. We 
can isolate a strong association between two variables by using statistical 
analysis, but we cannot necessarily explain why that relationship exists, and 
in some cases, we cannot know for certain that the relationship is causal, 
meaning that a change in one variable is really causing a change in the 
other. In the case of terrorism. Professor Krueger hypothesizes that since 
terrorists are motivated by political goals, those who are most educated and 


affluent have the strongest incentive to change society. These individuals 
may also be particularly rankled by suppression of freedom, another factor 
associated with terrorism. In Krueger’s study, countries with high levels of 
political repression have more terrorist activity (holding other factors 
constant). 

This discussion leads me back to the question posed by the chapter title: 
What is the point? The point is not to do math, or to dazzle friends and 
colleagues with advanced statistical techniques. The point is to learn things 
that inform our lives. 


Lies, Damned Lies, and Statistics 

Even in the best of circumstances, statistical analysis rarely unveils “the 
truth.” We are usually building a circumstantial case based on imperfect 
data. As a result, there are numerous reasons that intellectually honest 
individuals may disagree about statistical results or their implications. At 
the most basic level, we may disagree on the question that is being 
answered. Sports enthusiasts will be arguing for all eternity over “the best 
baseball player ever” because there is no objective definition of “best.” 
Fancy descriptive statistics can inform this question, but they will never 
answer it definitively. As the next chapter will point out, more socially 
significant questions fall prey to the same basic challenge. What is 
happening to the economic health of the American middle class? That 
answer depends on how one defines both “middle class” and “economic 
health.” 

There are limits on the data we can gather and the kinds of experiments 
we can perform. Alan Krueger’s study of terrorists did not follow thousands 
of youth over multiple decades to observe which of them evolved into 
terrorists. It’s just not possible. Nor can we create two identical nations— 
except that one is highly repressive and the other is not—and then compare 
the number of suicide bombers that emerge in each. Even when we can 
conduct large, controlled experiments on human beings, they are neither 
easy nor cheap. Researchers did a large-scale study on whether or not 
prayer reduces postsurgical complications, which was one of the questions 



raised earlier in this chapter. That study cost $2.4 million. (For the results, 
you’ll have to wait until Chapter 13.) 

Secretary of Defense Donald Rumsfeld famously said, “You go to war 
with the army you have—not the army you might want or wish to have at a 
later time.” Whatever you may think of Rumsfeld (and the Iraq war that he 
was explaining), that aphorism applies to research, too. We conduct 
statistical analysis using the best data and methodologies and resources 
available. The approach is not like addition or long division, in which the 
correct technique yields the “right” answer and a computer is always more 
precise and less fallible than a human. Statistical analysis is more like good 
detective work (hence the commercial potential of CSI: Regression 
Analysis). Smart and honest people will often disagree about what the data 
are trying to tell us. 

But who says that everyone using statistics is smart or honest? As 
mentioned, this book began as an homage to How to Lie with Statistics, 
which was first published in 1954 and has sold over a million copies. The 
reality is that you can lie with statistics. Or you can make inadvertent 
errors. In either case, the mathematical precision attached to statistical 
analysis can dress up some serious nonsense. This book will walk through 
many of the most common statistical errors and misrepresentations (so that 
you can recognize them, not put them to use). 

So, to return to the title chapter, what is the point of learning statistics? 

To summarize huge quantities of data. 

To make better decisions. 

To answer important social questions. 

To recognize patterns that can refine how we do everything from selling 
diapers to catching criminals. 

To catch cheaters and prosecute criminals. 

To evaluate the effectiveness of policies, programs, drugs, medical 
procedures, and other innovations. 

And to spot the scoundrels who use these very same powerful tools for 
nefarious ends. 

If you can do all of that while looking great in a Hugo Boss suit or a 
short black skirt, then you might also be the next star of CSI: Regression 
Analysis. 



* The Gini index is sometimes multiplied by 100 to make it a whole number. In that case, the United 
States would have a Gini Index of 45. 

* The word “data” has historically been considered plural (e.g., “The data are very encouraging.”) 
The singular is “datum,” which would refer to a single data point, such as one person’s response to a 
single question on a poll. Using the word “data” as a plural noun is a quick way to signal to anyone 
who does serious research that you are conversant with statistics. That said, many authorities on 
grammar and many publications, such as the New York Times, now accept that “data” can be singular 
or plural, as the passage that I’ve quoted from the Times demonstrates. 

* This is a gross simplification of the fascinating and complex field of medical ethics. 


CHAPTER 2 


Descriptive Statistics 
Who was the best baseball player of all time? 


Let us ponder for a moment two seemingly unrelated questions: (1) What 
is happening to the economic health of America’s middle class? and (2) 
Who was the greatest baseball player of all time? 

The first question is profoundly important. It tends to be at the core of 
presidential campaigns and other social movements. The middle class is the 
heart of America, so the economic well-being of that group is a crucial 
indicator of the nation’s overall economic health. The second question is 
trivial (in the literal sense of the word), but baseball enthusiasts can argue 
about it endlessly. What the two questions have in common is that they can 
be used to illustrate the strengths and limitations of descriptive statistics, 
which are the numbers and calculations we use to summarize raw data. 

If I want to demonstrate that Derek Jeter is a great baseball player, I can 
sit you down and describe every at bat in every Major League game that 
he’s played. That would be raw data, and it would take a while to digest, 
given that Jeter has played seventeen seasons with the New York Yankees 
and taken 9,868 at bats. 

Or I can just tell you that at the end of the 2011 season Derek Jeter had a 
career batting average of .313. That is a descriptive statistic, or a “summary 
statistic.” 

The batting average is a gross simplification of Jeter’s seventeen seasons. 
It is easy to understand, elegant in its simplicity—and limited in what it can 
tell us. Baseball experts have a bevy of descriptive statistics that they 
consider to be more valuable than the batting average. I called Steve Moyer, 
president of Baseball Info Solutions (a firm that provides a lot of the raw 
data for the Moneyball types), to ask him, (1) What are the most important 
statistics for evaluating baseball talent? and (2) Who was the greatest player 
of all time? I’ll share his answer once we have more context. 


Meanwhile, let’s return to the less trivial subject, the economic health of 
the middle class. Ideally we would like to find the economic equivalent of a 
batting average, or something even better. We would like a simple but 
accurate measure of how the economic well-being of the typical American 
worker has been changing in recent years. Are the people we define as 
middle class getting richer, poorer, or just running in place? A reasonable 
answer—though by no means the “right” answer—would be to calculate the 
change in per capita income in the United States over the course of a 
generation, which is roughly thirty years. Per capita income is a simple 
average: total income divided by the size of the population. By that 
measure, average income in the United States climbed from $7,787 in 1980 
to $26,487 in 2010 (the latest year for which the government has data). 1 
Voila! Congratulations to us. 

There is just one problem. My quick calculation is technically correct 
and yet totally wrong in terms of the question I set out to answer. To begin 
with, the figures above are not adjusted for inflation. (A per capita income 
of $7,787 in 1980 is equal to about $19,600 when converted to 2010 
dollars.) That’s a relatively quick fix. The bigger problem is that the 
average income in America is not equal to the income of the average 
American. Let’s unpack that clever little phrase. 

Per capita income merely takes all of the income earned in the country 
and divides by the number of people, which tells us absolutely nothing 
about who is earning how much of that income—in 1980 or in 2010. As the 
Occupy Wall Street folks would point out, explosive growth in the incomes 
of the top 1 percent can raise per capita income significantly without 
putting any more money in the pockets of the other 99 percent. In other 
words, average income can go up without helping the average American. 

As with the baseball statistic query, I have sought outside expertise on 
how we ought to measure the health of the American middle class. I asked 
two prominent labor economists, including President Obama’s top 
economic adviser, what descriptive statistics they would use to assess the 
economic well-being of a typical American. Yes, you will get that answer, 
too, once we’ve taken a quick tour of descriptive statistics to give it more 
meaning. 


From baseball to income, the most basic task when working with data is 
to summarize a great deal of information. There are some 330 million 
residents in the United States. A spreadsheet with the name and income 
history of every American would contain all the information we could ever 
want about the economic health of the country—yet it would also be so 
unwieldy as to tell us nothing at all. The irony is that more data can often 
present less clarity. So we simplify. We perform calculations that reduce a 
complex array of data into a handful of numbers that describe those data, 
just as we might encapsulate a complex, multifaceted Olympic gymnastics 
performance with one number: 9.8. 

The good news is that these descriptive statistics give us a manageable 
and meaningful summary of the underlying phenomenon. That’s what this 
chapter is about. The bad news is that any simplification invites abuse. 
Descriptive statistics can be like online dating profiles: technically accurate 
and yet pretty darn misleading. 

Suppose you are at work, idly surfing the Web when you stumble across a 
riveting day-by-day account of Kim Kardashian’s failed seventy-two-day 
marriage to professional basketball player Kris Humphries. You have 
finished reading about day seven of the marriage when your boss shows up 
with two enormous files of data. One file has warranty claim information 
for each of the 57,334 laser printers that your firm sold last year. (For each 
printer sold, the file documents the number of quality problems that were 
reported during the warranty period.) The other file has the same 
information for each of the 994,773 laser printers that your chief competitor 
sold during the same stretch. Your boss wants to know how your firm’s 
printers compare in terms of quality with the competition. 

Fortunately the computer you’ve been using to read about the Kardashian 
marriage has a basics statistics package, but where do you begin? Your 
instincts are probably correct: The first descriptive task is often to find 
some measure of the “middle” of a set of data, or what statisticians might 
describe as its “central tendency.” What is the typical quality experience for 
your printers compared with those of the competition? The most basic 
measure of the “middle” of a distribution is the mean, or average. In this 
case, we want to know the average number of quality problems per printer 
sold for your firm and for your competitor. You would simply tally the total 



number of quality problems reported for all printers during the warranty 
period and then divide by the total number of printers sold. (Remember, the 
same printer can have multiple problems while under warranty.) You would 
do that for each firm, creating an important descriptive statistic: the average 
number of quality problems per printer sold. 

Suppose it turns out that your competitor’s printers have an average of 
2.8 quality-related problems per printer during the warranty period 
compared with your firm’s average of 9.1 reported defects. That was easy. 
You’ve just taken information on a million printers sold by two different 
companies and distilled it to the essence of the problem: your printers break 
a lot. Clearly it’s time to send a short e-mail to your boss quantifying this 
quality gap and then get back to day eight of Kim Kardashian’s marriage. 

Or maybe not. I was deliberately vague earlier when I referred to the 
“middle” of a distribution. The mean, or average, turns out to have some 
problems in that regard, namely, that it is prone to distortion by “outliers,” 
which are observations that lie farther from the center. To get your mind 
around this concept, imagine that ten guys are sitting on bar stools in a 
middle-class drinking establishment in Seattle; each of these guys earns 
$35,000 a year, which makes the mean annual income for the group 
$35,000. Bill Gates walks into the bar with a talking parrot perched on his 
shoulder. (The parrot has nothing to do with the example, but it kind of 
spices things up.) Let’s assume for the sake of the example that Bill Gates 
has an annual income of $1 billion. When Bill sits down on the eleventh bar 
stool, the mean annual income for the bar patrons rises to about $91 million. 
Obviously none of the original ten drinkers is any richer (though it might be 
reasonable to expect Bill Gates to buy a round or two). If I were to describe 
the patrons of this bar as having an average annual income of $91 million, 
the statement would be both statistically correct and grossly misleading. 
This isn’t a bar where multimillionaires hang out; it’s a bar where a bunch 
of guys with relatively low incomes happen to be sitting next to Bill Gates 
and his talking parrot. The sensitivity of the mean to outliers is why we 
should not gauge the economic health of the American middle class by 
looking at per capita income. Because there has been explosive growth in 
incomes at the top end of the distribution—CEOs, hedge fund managers, 
and athletes like Derek Jeter—the average income in the United States 



could be heavily skewed by the megarich, making it look a lot like the bar 
stools with Bill Gates at the end. 

For this reason, we have another statistic that also signals the “middle” of 
a distribution, albeit differently: the median. The median is the point that 
divides a distribution in half, meaning that half of the observations lie above 
the median and half lie below. (If there is an even number of observations, 
the median is the midpoint between the two middle observations.) If we 
return to the bar stool example, the median annual income for the ten guys 
originally sitting in the bar is $35,000. When Bill Gates walks in with his 
parrot and perches on a stool, the median annual income for the eleven of 
them is still $35,000. If you literally envision lining up the bar patrons on 
stools in ascending order of their incomes, the income of the guy sitting on 
the sixth stool represents the median income for the group. If Warren 
Buffett comes in and sits down on the twelfth stool next to Bill Gates, the 
median still does not change. 

For distributions without serious outliers, the median and the mean will 
be similar. I’ve included a hypothetical summary of the quality data for the 
competitor’s printers. In particular. I’ve laid out the data in what is known 
as a frequency distribution. The number of quality problems per printer is 
arrayed along the bottom; the height of each bar represents the percentages 
of printers sold with that number of quality problems. For example, 36 
percent of the competitor’s printers had two quality defects during the 
warranty period. Because the distribution includes all possible quality 
outcomes, including zero defects, the proportions must sum to 1 (or 100 
percent). 

Frequency Distribution of Quality Complaints for 
Competitor’s Printers 
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Because the distribution is nearly symmetrical, the mean and median are 
relatively close to one another. The distribution is slightly skewed to the 
right by the small number of printers with many reported quality defects. 
These outliers move the mean slightly rightward but have no impact on the 
median. Suppose that just before you dash off the quality report to your 
boss you decide to calculate the median number of quality problems for 
your firm’s printers and the competition’s. With a few keystrokes, you get 
the result. The median number of quality complaints for the competitor’s 
printers is 2; the median number of quality complaints for your company’s 
printers is 1. 

Huh? Your firm’s median number of quality complaints per printer is 
actually lower than your competitor’s. Because the Kardashian marriage is 
getting monotonous, and because you are intrigued by this finding, you 
print a frequency distribution for your own quality problems. 

Frequency Distribution of Quality Complaints at Your 

Company 
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What becomes clear is that your firm does not have a uniform quality 
problem; you have a “lemon” problem; a small number of printers have a 
huge number of quality complaints. These outliers inflate the mean but not 
the median. More important from a production standpoint, you do not need 
to retool the whole manufacturing process; you need only figure out where 
the egregiously low-quality printers are coming from and fix that.* 

Neither the median nor the mean is hard to calculate; the key is 
determining which measure of the “middle” is more accurate in a particular 





situation (a phenomenon that is easily exploited). Meanwhile, the median 
has some useful relatives. As we’ve already discussed, the median divides a 
distribution in half. The distribution can be further divided into quarters, or 
quartiles. The first quartile consists of the bottom 25 percent of the 
observations; the second quartile consists of the next 25 percent of the 
observations; and so on. Or the distribution can be divided into deciles, 
each with 10 percent of the observations. (If your income is in the top decile 
of the American income distribution, you would be earning more than 90 
percent of your fellow workers.) We can go even further and divide the 
distribution into hundredths, or percentiles. Each percentile represents 1 
percent of the distribution, so that the 1st percentile represents the bottom 1 
percent of the distribution and the 99th percentile represents the top 1 
percent of the distribution. 

The benefit of these kinds of descriptive statistics is that they describe 
where a particular observation lies compared with everyone else. If I tell 
you that your child scored in the 3rd percentile on a reading comprehension 
test, you should know immediately that the family should be logging more 
time at the library. You don’t need to know anything about the test itself, or 
the number of questions that your child got correct. The percentile score 
provides a ranking of your child’s score relative to that of all the other test 
takers. If the test was easy, then most test takers will have a high number of 
answers correct, but your child will have fewer correct than most of the 
others. If the test was extremely difficult, then all the test takers will have a 
low number of correct answers, but your child’s score will be lower still. 

Here is a good point to introduce some useful terminology. An “absolute” 
score, number, or figure has some intrinsic meaning. If I shoot 83 for 
eighteen holes of golf, that is an absolute figure. I may do that on a day that 
is 58 degrees, which is also an absolute figure. Absolute figures can usually 
be interpreted without any context or additional information. When I tell 
you that I shot 83, you don’t need to know what other golfers shot that day 
in order to evaluate my performance. (The exception might be if the 
conditions are particularly awful, or if the course is especially difficult or 
easy.) If I place ninth in the golf tournament, that is a relative statistic. A 
“relative” value or figure has meaning only in comparison to something 
else, or in some broader context, such as compared with the eight golfers 
who shot better than I did. Most standardized tests produce results that have 



meaning only as a relative statistic. If I tell you that a third grader in an 
Illinois elementary school scored 43 out of 60 on the mathematics portion 
of the Illinois State Achievement Test, that absolute score doesn’t have 
much meaning. But when I convert it to a percentile—meaning that I put 
that raw score into a distribution with the math scores for all other Illinois 
third graders—then it acquires a great deal of meaning. If 43 correct 
answers falls into the 83rd percentile, then this student is doing better than 
most of his peers statewide. If he’s in the 8th percentile, then he’s really 
struggling. In this case, the percentile (the relative score) is more 
meaningful than the number of correct answers (the absolute score). 

Another statistic that can help us describe what might otherwise be a 
jumble of numbers is the standard deviation, which is a measure of how 
dispersed the data are from their mean. In other words, how spread out are 
the observations? Suppose I collected data on the weights of 250 people on 
an airplane headed for Boston, and I also collected the weights of a sample 
of 250 qualifiers for the Boston Marathon. Now assume that the mean 
weight for both groups is roughly the same, say 155 pounds. Anyone who 
has been squeezed into a row on a crowded flight, fighting for the armrest, 
knows that many people on a typical commercial flight weigh more than 
155 pounds. But you may recall from those same unpleasant, overcrowded 
flights that there were lots of crying babies and poorly behaved children, all 
of whom have enormous lung capacity but not much mass. When it comes 
to calculating the average weight on the flight, the heft of the 320-pound 
football players on either side of your middle seat is likely offset by the tiny 
screaming infant across the row and the six-year-old kicking the back of 
your seat from the row behind. 

On the basis of the descriptive tools introduced so far, the weights of the 
airline passengers and the marathoners are nearly identical. But they’re not. 
Yes, the weights of the two groups have roughly the same “middle,” but the 
airline passengers have far more dispersion around that midpoint, meaning 
that their weights are spread farther from the midpoint. My eight-year-old 
son might point out that the marathon runners look like they all weigh the 
same amount, while the airline passengers have some tiny people and some 
bizarrely large people. The weights of the airline passengers are “more 
spread out,” which is an important attribute when it comes to describing the 
weights of these two groups. The standard deviation is the descriptive 



statistic that allows us to assign a single number to this dispersion around 
the mean. The formulas for calculating the standard deviation and the 
variance (another common measure of dispersion from which the standard 
deviation is derived) are included in an appendix at the end of the chapter. 
For now, let’s think about why the measuring of dispersion matters. 

Suppose you walk into the doctor’s office. You’ve been feeling fatigued 
ever since your promotion to head of North American printer quality. Your 
doctor draws blood, and a few days later her assistant leaves a message on 
your answering machine to inform you that your HCb2 count (a fictitious 
blood chemical) is 134. You rush to the Internet and discover that the mean 
HCb2 count for a person your age is 122 (and the median is about the 
same). Holy crap! If you’re like me, you would finally draft a will. You’d 
write tearful letters to your parents, spouse, children, and close friends. You 
might take up skydiving or try to write a novel very fast. You would send 
your boss a hastily composed e-mail comparing him to a certain part of the 
human anatomy—IN ALL CAPS. 

None of these things may be necessary (and the e-mail to your boss could 
turn out very badly). When you call the doctor’s office back to arrange for 
your hospice care, the physician’s assistant informs you that your count is 
within the normal range. But how could that be? “My count is 12 points 
higher than average!” you yell repeatedly into the receiver. 

“The standard deviation for the HCb2 count is 18,” the technician 
informs you curtly. 

What the heck does that mean? 

There is natural variation in the HCb2 count, as there is with most 
biological phenomena (e.g., height). While the mean count for the fake 
chemical might be 122, plenty of healthy people have counts that are higher 
or lower. The danger arises only when the HCb2 count gets excessively 
high or low. So how do we figure out what “excessively” means in this 
context? As we’ve already noted, the standard deviation is a measure of 
dispersion, meaning that it reflects how tightly the observations cluster 
around the mean. For many typical distributions of data, a high proportion 
of the observations lie within one standard deviation of the mean (meaning 
that they are in the range from one standard deviation below the mean to 
one standard deviation above the mean). To illustrate with a simple 
example, the mean height for American adult men is 5 feet 10 inches. The 



standard deviation is roughly 3 inches. A high proportion of adult men are 
between 5 feet 7 inches and 6 feet 1 inch. 

Or, to put it slightly differently, any man in this height range would not 
be considered abnormally short or tall. Which brings us back to your 
troubling HCb2 results. Yes, your count is 12 above the mean, but that’s 
less than one standard deviation, which is the blood chemical equivalent of 
being about 6 feet tall—not particularly unusual. Of course, far fewer 
observations lie two standard deviations from the mean, and fewer still lie 
three or four standard deviations away. (In the case of height, an American 
man who is three standard deviations above average in height would be 6 
feet 7 inches or taller.) 

Some distributions are more dispersed than others. Hence, the standard 
deviation of the weights of the 250 airline passengers will be higher than 
the standard deviation of the weights of the 250 marathon runners. A 
frequency distribution with the weights of the airline passengers would 
literally be fatter (more spread out) than a frequency distribution of the 
weights of the marathon runners. Once we know the mean and standard 
deviation for any collection of data, we have some serious intellectual 
traction. For example, suppose I tell you that the mean score on the SAT 
math test is 500 with a standard deviation of 100. As with height, the bulk 
of students taking the test will be within one standard deviation of the 
mean, or between 400 and 600. How many students do you think score 720 
or higher? Probably not very many, since that is more than two standard 
deviations above the mean. 

In fact, we can do even better than “not very many.” This is a good time 
to introduce one of the most important, helpful, and common distributions 
in statistics: the normal distribution. Data that are distributed normally are 
symmetrical around their mean in a bell shape that will look familiar to you. 

The normal distribution describes many common phenomena. Imagine a 
frequency distribution describing popcorn popping on a stove top. Some 
kernels start to pop early, maybe one or two pops per second; after ten or 
fifteen seconds, the kernels are exploding frenetically. Then gradually the 
number of kernels popping per second fades away at roughly the same rate 
at which the popping began. The heights of American men are distributed 
more or less normally, meaning that they are roughly symmetrical around 
the mean of 5 feet 10 inches. Each SAT test is specifically designed to 



produce a normal distribution of scores with mean 500 and standard 
deviation of 100. According to the Wall Street Journal, Americans even 
tend to park in a normal distribution at shopping malls; most cars park 
directly opposite the mall entrance—the “peak” of the normal curve—with 
“tails” of cars going off to the right and left of the entrance. 

The beauty of the normal distribution—its Michael Jordan power, 
finesse, and elegance—comes from the fact that we know by definition 
exactly what proportion of the observations in a normal distribution lie 
within one standard deviation of the mean (68.2 percent), within two 
standard deviations of the mean (95.4 percent), within three standard 
deviations (99.7 percent), and so on. This may sound like trivia. In fact, it is 
the foundation on which much of statistics is built. We will come back to 
this point in much great depth later in the book. 

The Normal Distribution 



The mean is the middle line which is often represented by the Greek 
letter p. The standard deviation is often represented by the Greek letter a. 
Each band represents one standard deviation. 

Descriptive statistics are often used to compare two figures or quantities. 
I’m one inch taller than my brother; today’s temperature is nine degrees 
above the historical average for this date; and so on. Those comparisons 
make sense because most of us recognize the scale of the units involved. 
One inch does not amount to much when it comes to a person’s height, so 



you can infer that my brother and I are roughly the same height. 
Conversely, nine degrees is a significant temperature deviation in just about 
any climate at any time of year, so nine degrees above average makes for a 
day that is much hotter than usual. But suppose that I told you that Granola 
Cereal A contains 31 milligrams more sodium than Granola Cereal B. 
Unless you know an awful lot about sodium (and the serving sizes for 
granola cereal), that statement is not going to be particularly informative. 
Or what if I told you that my cousin A1 earned $53,000 less this year than 
last year? Should we be worried about Al? Or is he a hedge fund manager 
for whom $53,000 is a rounding error in his annual compensation? 

In both the sodium and the income examples, we’re missing context. The 
easiest way to give meaning to these relative comparisons is by using 
percentages. It would mean something if I told you that Granola Bar A has 
50 percent more sodium than Granola Bar B, or that Uncle Al’s income fell 
47 percent last year. Measuring change as a percentage gives us some sense 
of scale. 

You probably learned how to calculate percentages in fourth grade and 
will be tempted to skip the next few paragraphs. Fair enough. But first do 
one simple exercise for me. Assume that a department store is selling a 
dress for $100. The assistant manager marks down all merchandise by 25 
percent. But then that assistant manager is fired for hanging out in a bar 
with Bill Gates,* and the new assistant manager raises all prices by 25 
percent. What is the final price of the dress? If you said (or thought) $100, 
then you had better not skip any paragraphs. 

The final price of the dress is actually $93.75. This is not merely a fun 
parlor trick that will win you applause and adulation at cocktail parties. 
Percentages are useful—but also potentially confusing or even deceptive. 
The formula for calculating a percentage difference (or change) is the 
following: (new figure - original figure)/original figure. The numerator (the 
part on the top of the fraction) gives us the size of the change in absolute 
terms; the denominator (the bottom of the fraction) is what puts this change 
in context by comparing it with our starting point. At first, this seems 
straightforward, as when the assistant store manager cuts the price of the 
$100 dress by 25 percent. Twenty-five percent of the original $100 price is 
$25; that’s the discount, which takes the price down to $75. You can plug 


the numbers into the formula above and do some simple manipulation to get 
to the same place: ($100 - $75)/$100 = .25, or 25 percent. 

The dress is selling for $75 when the new assistant manager demands 
that the price be raised 25 percent. That’s where many of the people reading 
this paragraph probably made a mistake. The 25 percent markup is 
calculated as a percentage of the dress’s new reduced price, which is $75. 
The increase will be .25($75), or $18.75, which is how the final price ends 
up at $93.75 (and not $100). The point is that a percentage change always 
gives the value of some figure relative to something else. Therefore, we had 
better understand what that something else is. 

I once invested some money in a company that my college roommate 
started. Since it was a private venture, there were no requirements as to 
what information had to be provided to shareholders. A number of years 
went by without any information on the fate of my investment; my former 
roommate was fairly tight-lipped on the subject. Finally, I received a letter 
in the mail informing me that the firm’s profits were 46 percent higher than 
the year before. There was no information on the size of those profits in 
absolute terms, meaning that I still had absolutely no idea how my 
investment was performing. Suppose that last year the firm earned 27 cents 
—essentially nothing. This year the firm earned 39 cents—also essentially 
nothing. Yet the company’s profits grew from 27 cents to 39 cents, which is 
technically a 46 percent increase. Obviously the shareholder letter would 
have been more of a downer if it pointed out that the firm’s cumulative 
profits over two years were less than the cost of a cup of Starbucks coffee. 

To be fair to my roommate, he eventually sold the company for hundreds 
of millions of dollars, earning me a 100 percent return on my investment. 
(Since you have no idea how much I invested, you also have no idea how 
much money I made—which reinforces my point here very nicely!) 

Let me make one additional distinction. Percentage change must not be 
confused with a change in percentage points. Rates are often expressed in 
percentages. The sales tax rate in Illinois is 6.75 percent. I pay my agent 15 
percent of my book royalties. These rates are levied against some quantity, 
such as income in the case of the income tax rate. Obviously the rates can 
go up or down; less intuitively, the changes in the rates can be described in 
vastly dissimilar ways. The best example of this was a recent change in the 
Illinois personal income tax, which was raised from 3 percent to 5 percent. 



There are two ways to express this tax change, both of which are 
technically accurate. The Democrats, who engineered this tax increase, 
pointed out (correctly) that the state income tax rate was increased by 2 
percentage points (from 3 percent to 5 percent). The Republicans pointed 
out (also correctly) that the state income tax had been raised by 67 percent. 
[This is a handy test of the formula from a few paragraphs back: (5 - 3)/3 = 
2/3, which rounds up to 67 percent.] 

The Democrats focused on the absolute change in the tax rate; 
Republicans focused on the percentage change in the tax burden. As noted, 
both descriptions are technically correct, though I would argue that the 
Republican description more accurately conveys the impact of the tax 
change, since what I’m going to have to pay to the government—the 
amount that I care about, as opposed to the way it is calculated—really has 
gone up by 67 percent. 

Many phenomena defy perfect description with a single statistic. Suppose 
quarterback Aaron Rodgers throws for 365 yards but no touchdowns. 
Meanwhile, Peyton Manning throws for a meager 127 yards but three 
touchdowns. Manning generated more points, but presumably Rodgers set 
up touchdowns by marching his team down the field and keeping the other 
team’s offense off the field. Who played better? In Chapter 1, I discussed 
the NFL passer rating, which is the league’s reasonable attempt to deal with 
this statistical challenge. The passer rating is an example of an index, which 
is a descriptive statistic made up of other descriptive statistics. Once these 
different measures of performance are consolidated into a single number, 
that statistic can be used to make comparisons, such as ranking 
quarterbacks on a particular day, or even over a whole career. If baseball 
had a similar index, then the question of the best player ever would be 
solved. Or would it? 

The advantage of any index is that it consolidates lots of complex 
information into a single number. We can then rank things that otherwise 
defy simple comparison—anything from quarterbacks to colleges to beauty 
pageant contestants. In the Miss America pageant, the overall winner is a 
combination of five separate competitions: personal interview, swimsuit, 
evening wear, talent, and onstage question. (Miss Congeniality is voted on 
separately by the participants themselves.) 



Alas, the disadvantage of any index is that it consolidates lots of complex 
information into a single number. There are countless ways to do that; each 
has the potential to produce a different outcome. Malcolm Gladwell makes 
this point brilliantly in a New Yorker piece critiquing our compelling need 
to rank things. 2 (He comes down particularly hard on the college rankings.) 
Gladwell offers the example of Car and Driver’s ranking of three sports 
cars: the Porsche Cayman, the Chevrolet Corvette, and the Lotus Evora. 
Using a formula that includes twenty-one different variables. Car and 
Driver ranked the Porsche number one. But Gladwell points out that 
“exterior styling” counts for only 4 percent of the total score in the Car and 
Driver formula, which seems ridiculously low for a sports car. If styling is 
given more weight in the overall ranking (25 percent), then the Lotus comes 
out on top. 

But wait. Gladwell also points out that the sticker price of the car gets 
relatively little weight in the Car and Driver formula. If value is weighted 
more heavily (so that the ranking is based equally on price, exterior styling, 
and vehicle characteristics), the Chevy Corvette is ranked number one. 

Any index is highly sensitive to the descriptive statistics that are cobbled 
together to build it, and to the weight given to each of those components. 
As a result, indices range from useful but imperfect tools to complete 
charades. An example of the former is the United Nations Human 
Development Index, or HDI. The HDI was created as a measure of 
economic well-being that is broader than income alone. The HDI uses 
income as one of its components but also includes measures of life 
expectancy and educational attainment. The United States ranks eleventh in 
the world in terms of per capita economic output (behind several oil-rich 
nations like Qatar, Brunei, and Kuwait) but fourth in the world in human 
development. 3 It’s true that the HDI rankings would change slightly if the 
component parts of the index were reconfigured, but no reasonable change 
is going to make Zimbabwe zoom up the rankings past Norway. The HDI 
provides a handy and reasonably accurate snapshot of living standards 
around the globe. 

Descriptive statistics give us insight into phenomena that we care about. In 
that spirit, we can return to the questions posed at the beginning of the 
chapter. Who is the best baseball player of all time? More important for the 


purposes of this chapter, what descriptive statistics would be most helpful in 
answering that question? According to Steve Moyer, president of Baseball 
Info Solutions, the three most valuable statistics (other than age) for 
evaluating any player who is not a pitcher would be the following: 

1. On-base percentage (OBP), sometimes called the on-base average 
(OBA): Measures the proportion of the time that a player reaches base 
successfully, including walks (which are not counted in the batting 
average). 

2. Slugging percentage (SLG): Measures power hitting by calculating 
the total bases reached per at bat. A single counts as 1, a double is 2, a 
triple is 3, and a home run is 4. Thus, a batter who hit a single and a 
triple in five at bats would have a slugging percentage of (1 + 3)/5, or 
.800. 

3. At bats (AB): Puts the above in context. Any mope can have 
impressive statistics for a game or two. A superstar compiles 
impressive “numbers” over thousands of plate appearances. 

In Moyer’s view (without hesitation, I might add), the best baseball player 
of all time was Babe Ruth because of his unique ability to hit and to pitch. 
Babe Ruth still holds the Major League career record for slugging 
percentage at .690. 4 

What about the economic health of the American middle class? Again, I 
deferred to the experts. I e-mailed Jeff Grogger (a colleague of mine at the 
University of Chicago) and Alan Krueger (the same Princeton economist 
who studied terrorists and is now serving as chair of President Obama’s 
Council of Economic Advisers). Both gave variations on the same basic 
answer. To assess the economic health of America’s “middle class,” we 
should examine changes in the median wage (adjusted for inflation) over 
the last several decades. They also recommended examining changes to 
wages at the 25th and 75th percentiles (which can reasonably be interpreted 
as the upper and lower bounds for the middle class). 

One more distinction is in order. When assessing economic health, we 
can examine income or wages. They are not the same thing. A wage is what 
we are paid for some fixed amount of labor, such as an hourly or weekly 
wage. Income is the sum of all payments from different sources. If workers 


take a second job or work more hours, their income can go up without a 
change in the wage. (For that matter, income can go up even if the wage is 
falling, provided a worker logs enough hours on the job.) However, if 
individuals have to work more in order to earn more, it’s hard to evaluate 
the overall effect on their well-being. The wage is a less ambiguous 
measure of how Americans are being compensated for the work they do; 
the higher the wage, the more workers take home for every hour on the job. 

Having said all that, here is a graph of American wages over the past 
three decades. I’ve also added the 90th percentile to illustrate changes in the 
wages for middle-class workers compared over this time frame to those 
workers at the top of the distribution. 



Week!)’Wages at Selected Percentiles 


Source: “Changes in the Distribution of Workers’ Hourly Wages between 1979 and 2009,” 
Congressional Budget Office, February 16, 2011. The data for the chart can be found at 
http://www.cbo.gov/sites/default/files/cbofiles/ftpdocs/120xx/docl2051/02-16-wagedispersion.pdf. 


A variety of conclusions can be drawn from these data. They do not 
present a single “right” answer with regard to the economic fortunes of the 
middle class. They do tell us that the typical worker, an American worker 
earning the median wage, has been “running in place” for nearly thirty 
years. Workers at the 90th percentile have done much, much better. 
Descriptive statistics help to frame the issue. What we do about it, if 
anything, is an ideological and political question. 
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Data for the printer defects graphics 
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Formula for variance and standard deviation 

Variance and standard deviation are the most common statistical 
mechanisms for measuring and describing the dispersion of a distribution. 
The variance, which is often represented by the symbol a 2 , is calculated by 
determining how far the observations within a distribution lie from the 
mean. However, the twist is that the difference between each observation 
and the mean is squared; the sum of those squared terms is then divided by 
the number of observations. 

Specifically: 


For any set of n observations .v j, *2, *3 .. . .t n with mean p, 

Variance so 2 * f(xj - p) 2 + (\2 - p) 2 + (X3 - p) 2 + ... (x n - p) 2 ]/n 

Because the difference between each term and the mean is squared, the 
formula for calculating variance puts particular weight on observations that 
lie far from the mean, or outliers, as the following table of student heights 
illustrates. 



















Group 

1 

Height 
(p = 70 
inches) 

Distance Bom 
the mean- 

Absoture 
value of 

Wn-P) 2 

Group 2 

Height 
0i = 70 
inches) 

Distance from 
the mean = 

Absolute 
value of 
<i B -P)* 

On-P) 2 

Nick 

74 

4 

16 

Sahar 

65 

5 

25 

FJana 

66 

4 

16 

Maggie 

68 

2 

4 

Dinah 

68 

2 

4 

Faisal 

69 

1 

1 

Rebecca 

69 

1 

1 

Ted 

70 

0 

0 

Ben 

7} 

3 

9 

Jeff 

71 

1 

1 

Chani 

70 

0 

0 

Narciso 

75 
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25 



Total = 14 

Total = 46 



Total =14 

Total = 56 




Variance = 

46/6 = 7.7 




Variance = 

S6/6 = 9.3 




Standard 

deviation = 

JTj = 2 J8 




Standard 

deviation = 

/9T=3 


* Absolute value is the distance between two figures, regardless of direction, so that it is always 
positive. In this case, it represents the number of inches between the height of the individual and the 
mean. 


Both groups of students have a mean height of 70 inches. The heights of 
students in both groups also differ from the mean by the same number of 
total inches: 14. By that measure of dispersion, the two distributions are 
identical. However, the variance for Group 2 is higher because of the 
weight given in the variance formula to values that lie particularly far from 
the mean—Sahar and Narciso in this case. 

Variance is rarely used as a descriptive statistic on its own. Instead, the 
variance is most useful as a step toward calculating the standard deviation 
of a distribution, which is a more intuitive tool as a descriptive statistic. 

The standard deviation for a set of observations is the square root of 
the variance: 

For any set of n observations x t x 2 x 3 x n with mean p, 
standard deviation = 0 = square root of this whole quantity = 

\' [(*1 - p) 2 + (*2 - P) 2 ♦ (*3 - P) 2 + • • • (*n " M) 2 V n 


* With twelve bar patrons, the median would be the midpoint between the income of the guy on the 
sixth stool and the income of the guy on the seventh stool. Since they both make $35,000, the median 
is $35,000. If one made $35,000 and the other made $36,000, the median for the whole group would 
be $35,500. 




























* Manufacturing update: It turns out that nearly all of the defective printers were being manufactured 
at a plant in Kentucky where workers had stripped parts off the assembly line in order to build a 
bourbon distillery. Both the perpetually drunk employees and the random missing pieces on the 
assembly line appear to have compromised the quality of the printers being produced there. 

* Remarkably, this person was one of the ten people with annual incomes of $35,000 who were 
sitting on bar stools when Bill Gates walked in with his parrot. Go figure! 


CHAPTER 3 


Deceptive Description 
“He’s got a great personality!” and other 
true but grossly misleading statements 


To anyone who has ever contemplated dating, the phrase “he’s got a great 
personality” usually sets off alarm bells, not because the description is 
necessarily wrong, but for what it may not reveal, such as the fact that the 
guy has a prison record or that his divorce is “not entirely final.” We don’t 
doubt that this guy has a great personality; we are wary that a true 
statement, the great personality, is being used to mask or obscure other 
information in a way that is seriously misleading (assuming that most of us 
would prefer not to date ex-felons who are still married). The statement is 
not a lie per se, meaning that it wouldn’t get you convicted of perjury, but it 
still could be so inaccurate as to be untruthful. 

And so it is with statistics. Although the field of statistics is rooted in 
mathematics, and mathematics is exact, the use of statistics to describe 
complex phenomena is not exact. That leaves plenty of room for shading 
the truth. Mark Twain famously remarked that there are three kinds of lies: 
lies, damned lies, and statistics.* As the last chapter explained, most 
phenomena that we care about can be described in multiple ways. Once 
there are multiple ways of describing the same thing (e.g., “he’s got a great 
personality” or “he was convicted of securities fraud”), the descriptive 
statistics that we choose to use (or not to use) will have a profound impact 
on the impression that we leave. Someone with nefarious motives can use 
perfectly good facts and figures to support entirely disputable or illegitimate 
conclusions. 

We ought to begin with the crucial distinction between “precision” and 
“accuracy.” These words are not interchangeable. Precision reflects the 
exactitude with which we can express something. In a description of the 
length of your commute, “41.6 miles” is more precise than “about 40 


miles,” which is more precise than “a long f-ing way.” If you ask me 

how far it is to the nearest gas station, and I tell you that it’s 1.265 miles to 
the east, that’s a precise answer. Here is the problem: That answer may be 
entirely inaccurate if the gas station happens to be in the other direction. On 
the other hand, if I tell you, “Drive ten minutes or so until you see a hot dog 
stand. The gas station will be a couple hundred yards after that on the right. 
If you pass the Hooters, you’ve gone too far,” my answer is less precise 
than “1.265 miles to the east” but significantly better because I am sending 
you in the direction of the gas station. Accuracy is a measure of whether a 
figure is broadly consistent with the truth—hence the danger of confusing 
precision with accuracy. If an answer is accurate, then more precision is 
usually better. But no amount of precision can make up for inaccuracy. 

In fact, precision can mask inaccuracy by giving us a false sense of 
certainty, either inadvertently or quite deliberately. Joseph McCarthy, the 
Red-baiting senator from Wisconsin, reached the apogee of his reckless 
charges in 1950 when he alleged not only that the U.S. State Department 
was infiltrated with communists, but that he had a list of their names. 
During a speech in Wheeling, West Virginia, McCarthy waved in the air a 
piece of paper and declared, “I have here in my hand a list of 205—a list of 
names that were made known to the Secretary of State as being members of 
the Communist Party and who nevertheless are still working and shaping 
policy in the State Department.” 1 It turns out that the paper had no names 
on it at all, but the specificity of the charge gave it credibility, despite the 
fact that it was a bald-faced lie. 

I learned the important distinction between precision and accuracy in a 
less malicious context. For Christmas one year my wife bought me a golf 
range finder to calculate distances on the course from my golf ball to the 
hole. The device works with some kind of laser; I stand next to my ball in 
the fairway (or rough) and point the range finder at the flag on the green, at 
which point the device calculates the exact distance that I’m supposed to hit 
the ball. This is an improvement upon the standard yardage markers, which 
give distances only to the center of the green (and are therefore accurate but 
less precise). With my Christmas-gift range finder I was able to know that I 
was 147.2 yards from the hole. I expected the precision of this nifty 
technology to improve my golf game. Instead, it got appreciably worse. 



There were two problems. First, I used the stupid device for three months 
before I realized that it was set to meters rather than to yards; every 
seemingly precise calculation (147.2) was wrong. Second, I would 
sometimes inadvertently aim the laser beam at the trees behind the green, 
rather than at the flag marking the hole, so that my “perfect” shot would go 
exactly the distance it was supposed to go—right over the green into the 
forest. The lesson for me, which applies to all statistical analysis, is that 
even the most precise measurements or calculations should be checked 
against common sense. 

To take an example with more serious implications, many of the Wall 
Street risk management models prior to the 2008 financial crisis were quite 
precise. The concept of “value at risk” allowed firms to quantify with 
precision the amount of the firm’s capital that could be lost under different 
scenarios. The problem was that the supersophisticated models were the 
equivalent of setting my range finder to meters rather than to yards. The 
math was complex and arcane. The answers it produced were reassuringly 
precise. But the assumptions about what might happen to global markets 
that were embedded in the models were just plain wrong, making the 
conclusions wholly inaccurate in ways that destabilized not only Wall Street 
but the entire global economy. 

Even the most precise and accurate descriptive statistics can suffer from 
a more fundamental problem: a lack of clarity over what exactly we are 
trying to define, describe, or explain. Statistical arguments have much in 
common with bad marriages; the disputants often talk past one another. 
Consider an important economic question: How healthy is American 
manufacturing? One often hears that American manufacturing jobs are 
being lost in huge numbers to China, India, and other low-wage countries. 
One also hears that high-tech manufacturing still thrives in the United 
States and that America remains one of the world’s top exporters of 
manufactured goods. Which is it? This would appear to be a case in which 
sound analysis of good data could reconcile these competing narratives. Is 
U.S. manufacturing profitable and globally competitive, or is it shrinking in 
the face of intense foreign competition? 

Both. The British news magazine the Economist reconciled the two 
seemingly contradictory views of American manufacturing with the 
following graph. 
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The seeming contradiction lies in how one defines the “health” of U.S. 
manufacturing. In terms of output—the total value of goods produced and 
sold—the U.S. manufacturing sector grew steadily in the 2000s, took a big 
hit during the Great Recession, and has since bounced back robustly. This is 
consistent with data from the CIA’s World Factbook showing that the 
United States is the third-largest manufacturing exporter in the world, 
behind China and Germany. The United States remains a manufacturing 
powerhouse. 

But the graph in the Economist has a second line, which is manufacturing 
employment. The number of manufacturing jobs in the United States has 
fallen steadily; roughly six mill ion manufacturing jobs were lost in the last 
decade. Together, these two stories—rising manufacturing output and 
falling employment—tell the complete story. Manufacturing in the United 
States has grown steadily more productive, meaning that factories are 
producing more output with fewer workers. This is good from a global 
competitiveness standpoint, for it makes American products more 
competitive with manufactured goods from low-wage countries. (One way 
to compete with a firm that can pay workers $2 an hour is to create a 
manufacturing process so efficient that one worker earning $40 can do 
twenty times as much.) But there are a lot fewer manufacturing jobs, which 
is terrible news for the displaced workers who depended on those wages. 

Since this is a book about statistics and not manufacturing, let’s go back 
to the main point, which is that the “health” of U.S. manufacturing— 
something seemingly easy to quantify—depends on how one chooses to 
define health: output or employment? In this case (and many others), the 






most complete story comes from including both figures, as the Economist 
wisely chose to do in its graph. 

Even when we agree on a single measure of success, say, student test 
scores, there is plenty of statistical wiggle room. See if you can reconcile 
the following hypothetical statements, both of which could be true: 

Politician A (the challenger): “Our schools are getting worse! Sixty 
percent of our schools had lower test scores this year than last year.” 

Politician B (the incumbent): “Our schools are getting better! Eighty 
percent of our students had higher test scores this year than last year.” 

Here’s a hint: The schools do not all necessarily have the same number of 
students. If you take another look at the seemingly contradictory statements, 
what you’ll see is that one politician is using schools as his unit of analysis 
(“Sixty percent of our schools . . .”), and the other is using students as the 
unit of analysis (“Eighty percent of our students . . .”). The unit of analysis 
is the entity being compared or described by the statistics—school 
performance by one of them and student performance by the other. It’s 
entirely possible for most of the students to be improving and most of the 
schools to be getting worse—if the students showing improvement happen 
to be in very big schools. To make this example more intuitive, let’s do the 
same exercise by using American states: 

Politician A (a populist): “Our economy is in the crapper! Thirty states 
had falling incomes last year.” 

Politician B (more of an elitist): “Our economy is showing appreciable 
gains: Seventy percent of Americans had rising incomes last year.” 

What I would infer from those statements is that the biggest states have 
the healthiest economies: New York, California, Texas, Illinois, and so on. 
The thirty states with falling average incomes are likely to be much smaller: 
Vermont, North Dakota, Rhode Island, and so on. Given the disparity in the 
size of the states, it’s entirely possible that the majority of states are doing 
worse while the majority of Americans are doing better. The key lesson is 
to pay attention to the unit of analysis. Who or what is being described, and 
is that different from the “who” or “what” being described by someone 
else? 

Although the examples above are hypothetical, here is a crucial statistical 
question that is not: Is globalization making income inequality around the 
planet better or worse? By one interpretation, globalization has merely 



exacerbated existing income inequalities; richer countries in 1980 (as 
measured by GDP per capita) tended to grow faster between 1980 and 2000 
than poorer countries. 2 The rich countries just got richer, suggesting that 
trade, outsourcing, foreign investment, and the other components of 
“globalization” are merely tools for the developed world to extend its 
economic hegemony. Down with globalization! Down with globalization! 

But hold on a moment. The same data can (and should) be interpreted 
entirely differently if one changes the unit of analysis. We don’t care about 
poor countries; we care about poor people. And a high proportion of the 
world’s poor people happen to live in China and India. Both countries are 
huge (with a population over a billion); each was relatively poor in 1980. 
Not only have China and India grown rapidly over the past several decades, 
but they have done so in large part because of their increased economic 
integration with the rest of the world. They are “rapid globalizes,” as the 
Economist has described them. Given that our goal is to ameliorate human 
misery, it makes no sense to give China (population 1.3 billion) the same 
weight as Mauritius (population 1.3 million) when examining the effects of 
globalization on the poor. 

The unit of analysis should be people, not countries. What really 
happened between 1980 and 2000 is a lot like my fake school example 
above. The bulk of the world’s poor happened to live in two giant countries 
that grew extremely fast as they became more integrated into the global 
economy. The proper analysis yields an entirely different conclusion about 
the benefits of globalization for the world’s poor. As the Economist points 
out, “If you consider people, not countries, global inequality is falling 
rapidly.” 

The telecommunications companies AT&T and Verizon have recently 
engaged in an advertising battle that exploits this kind of ambiguity about 
what is being described. Both companies provide cellular phone service. 
One of the primary concerns of most cell phone users is the quality of the 
service in places where they are likely to make or receive phone calls. Thus, 
a logical point of comparison between the two firms is the size and quality 
of their networks. While consumers just want decent cell phone service in 
lots of places, both AT&T and Verizon have come up with different metrics 
for measuring the somewhat amorphous demand for “decent cell phone 


service in lots of places.” Verizon launched an aggressive advertising 
campaign touting the geographic coverage of its network; you may 
remember the maps of the United States that showed the large percentage of 
the country covered by the Verizon network compared with the relatively 
paltry geographic coverage of the AT&T network. The unit of analysis 
chosen by Verizon is geographic area covered—because the company has 
more of it. 

AT&T countered by launching a campaign that changed the unit of 
analysis. Its billboards advertised that “AT&T covers 97 percent of 
Americans.” Note the use of the word “Americans” rather than “America.” 
AT&T focused on the fact that most people don’t live in rural Montana or 
the Arizona desert. Since the population is not evenly distributed across the 
physical geography of the United States, the key to good cell service (the 
campaign argued implicitly) is having a network in place where callers 
actually live and work, not necessarily where they go camping. As someone 
who spends a fair bit of time in rural New Hampshire, however, my 
sympathies are with Verizon on this one. 

Our old friends the mean and the median can also be used for nefarious 
ends. As you should recall from the last chapter, both the median and the 
mean are measures of the “middle” of a distribution, or its “central 
tendency.” The mean is a simple average: the sum of the observations 
divided by the number of observations. (The mean of 3, 4, 5, 6, and 102 is 
24.) The median is the midpoint of the distribution; half of the observations 
lie above the median and half lie below. (The median of 3, 4, 5, 6, and 102 
is 5.) Now, the clever reader will see that there is a sizable difference 
between 24 and 5. If, for some reason, I would like to describe this group of 
numbers in a way that makes it look big, I will focus on the mean. If I want 
to make it look smaller, I will cite the median. 

Now let’s look at how this plays out in real life. Consider the George W. 
Bush tax cuts, which were touted by the Bush administration as something 
good for most American families. While pushing the plan, the 
administration pointed out that 92 million Americans would receive an 
average tax reduction of over $1,000 ($1,083 to be precise). But was that 
summary of the tax cut accurate? According to the New York Times, “The 
data don’t lie, but some of them are mum.” 



Would 92 million Americans be getting a tax cut? Yes. 

Would most of those people be getting a tax cut of around $1,000? No. 
The median tax cut was less than $100. 

A relatively small number of extremely wealthy individuals were eligible 
for very large tax cuts; these big numbers skew the mean, making the 
average tax cut look bigger than what most Americans would likely receive. 
The median is not sensitive to outliers, and, in this case, is probably a more 
accurate description of how the tax cuts affected the typical household. 

Of course, the median can also do its share of dissembling because it is 
not sensitive to outliers. Suppose that you have a potentially fatal illness. 
The good news is that a new drug has been developed that might be 
effective. The drawback is that it’s extremely expensive and has many 
unpleasant side effects. “But does it work?” you ask. The doctor informs 
you that the new drug increases the median life expectancy among patients 
with your disease by two weeks. That is hardly encouraging news; the drug 
may not be worth the cost and unpleasantness. Your insurance company 
refuses to pay for the treatment; it has a pretty good case on the basis of the 
median life expectancy figures. 

Yet the median may be a horribly misleading statistic in this case. 
Suppose that many patients do not respond to the new treatment but that 
some large number of patients, say 30 or 40 percent, are cured entirely. This 
success would not show up in the median (though the mean life expectancy 
of those taking the drug would look very impressive). In this case, the 
outliers—those who take the drug and live for a long time—would be 
highly relevant to your decision. And it is not merely a hypothetical case. 
Evolutionary biologist Stephen Jay Gould was diagnosed with a form of 
cancer that had a median survival time of eight months; he died of a 
different and unrelated kind of cancer twenty years later. 3 Gould 
subsequently wrote a famous article called “The Median Isn’t the 
Message,” in which he argued that his scientific knowledge of statistics 
saved him from the erroneous conclusion that he would necessarily be dead 
in eight months. The definition of the median tells us that half the patients 
will live at least eight months—and possibly much, much longer than that. 
The mortality distribution is “right-skewed,” which is more than a 
technicality if you happen to have the disease. 4 


In this example, the defining characteristic of the median—that it does 
not weight observations on the basis of how far they lie from the midpoint, 
only on whether they lie above or below—turns out to be its weakness. In 
contrast, the mean is affected by dispersion. From the standpoint of 
accuracy, the median versus mean question revolves around whether the 
outliers in a distribution distort what is being described or are instead an 
important part of the message. (Once again, judgment trumps math.) Of 
course, nothing says that you must choose the median or the mean. Any 
comprehensive statistical analysis would likely present both. When just the 
median or the mean appears, it may be for the sake of brevity—or it may be 
because someone is seeking to “persuade” with statistics. 

Those of a certain age may remember the following exchange (as I recollect 
it) between the characters played by Chevy Chase and Ted Knight in the 
movie Caddyshack. The two men meet in the locker room after both have 
just come off the golf course: 

ted knight: What did you shoot? 

chevy chase: Oh, I don’t keep score. 

ted knight: Then how do you compare yourself to other golfers? 

chevy chase: By height. 

I’m not going to try to explain why this is funny. I will say that a great 
many statistical shenanigans arise from “apples and oranges” comparisons. 
Suppose you are trying to compare the price of a hotel room in London with 
the price of a hotel room in Paris. You send your six-year-old to the 
computer to do some Internet research, since she is much faster and better 
at it than you are. Your child reports back that hotel rooms in Paris are more 
expensive, around 180 a night; a comparable room in London is 150 a 
night. 

You would likely explain to your child the difference between pounds 
and euros, and then send her back to the computer to find the exchange rate 
between the two currencies so that you could make a meaningful 
comparison. (This example is loosely rooted in truth; after I paid 100 rupees 
for a pot of tea in India, my daughter wanted to know why everything in 
India was so expensive.) Obviously the numbers on currency from different 



countries mean nothing until we convert them into comparable units. What 
is the exchange rate between the pound and the euro, or, in the case of 
India, between the dollar and the rupee? 

This seems like a painfully obvious lesson—yet one that is routinely 
ignored, particularly by politicians and Hollywood studios. These folks 
clearly recognize the difference between euros and pounds; instead, they 
overlook a more subtle example of apples and oranges: inflation. A dollar 
today is not the same as a dollar sixty years ago; it buys much less. Because 
of inflation, something that cost $1 in 1950 would cost $9.37 in 2011. As a 
result, any monetary comparison between 1950 and 2011 without adjusting 
for changes in the value of the dollar would be less accurate than comparing 
figures in euros and pounds— since the euro and the pound are closer to 
each other in value than a 1950 dollar is to a 2011 dollar. 

This is such an important phenomenon that economists have terms to 
denote whether figures have been adjusted for inflation or not. Nominal 
figures are not adjusted for inflation. A comparison of the nominal cost of a 
government program in 1970 to the nominal cost of the same program in 
2011 merely compares the size of the checks that the Treasury wrote in 
those two years—without any recognition that a dollar in 1970 bought more 
stuff than a dollar in 2011. If we spent $10 million on a program in 1970 to 
provide war veterans with housing assistance and $40 million on the same 
program in 2011, the federal commitment to that program has actually gone 
down. Yes, spending has gone up in nominal terms, but that does not reflect 
the changing value of the dollars being spent. One 1970 dollar is equal to 
$5.83 in 2011; the government would need to spend $58.3 million on 
veterans’ housing benefits in 2011 to provide support comparable to the $10 
million it was spending in 1970. 

Real figures, on the other hand, are adjusted for inflation. The most 
commonly accepted methodology is to convert all of the figures into a 
single unit, such as 2011 dollars, to make an “apples and apples” 
comparison. Many websites, including that of the U.S. Bureau of Labor 
Statistics, have simple inflation calculators that will compare the value of a 
dollar at different points in time/ For a real (yes, a pun) example of how 
statistics can look different when adjusted for inflation, check out the 
following graph of the U.S. federal minimum wage, which plots both the 


nominal value of the minimum wage and its real purchasing power in 2010 
dollars. 
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The federal minimum wage—the number posted on the bulletin board in 
some remote corner of your office—is set by Congress. This wage, 
currently $7.25, is a nominal figure. Your boss does not have to ensure that 
$7.25 buys as much as it did two years ago; he just has to make sure that 
you get a minimum of $7.25 for every hour of work that you do. It’s all 
about the number on the check, not what that number can buy. 

Yet inflation erodes the purchasing power of the minimum wage over 
time (and every other nominal wage, which is why unions typically 
negotiate “cost of living adjustments”). If prices rise faster than Congress 
raises the minimum wage, the real value of that minimum hourly payment 
will fall. Supporters of a minimum wage should care about the real value of 
that wage, since the whole point of the law is to guarantee low-wage 
workers some minimum level of consumption for an hour of work, not to 
give them a check with a big number on it that buys less than it used to. (If 
that were the case, then we could just pay low-wage workers in rupees.) 

Hollywood studios may be the most egregiously oblivious to the 
distortions caused by inflation when comparing figures at different points in 
time—and deliberately so. What were the top five highest-grossing films 
(domestic) of all time as of 2011? 5 


































1. Avatar (2009) 

2. Titanic (1997) 

3. The Dark Knight (2008) 

4. Star Wars Episode IV (1977) 

5. Shrek 2 (2004) 

Now you may feel that list looks a little suspect. These were successful 
films—but Shrek 2? Was that really a greater commercial success than 
Gone with the Wind ? The Godfather ? Jaws ? No, no, and no. Hollywood 
likes to make each blockbuster look bigger and more successful than the 
last. One way to do that would be to quote box office receipts in Indian 
rupees, which would inspire headlines such as the following: “Harry Potter 
Breaks Box Office Record with Weekend Receipts of 1.3 Trillion!” But 
even the most dim-witted moviegoers would be suspicious of figures that 
are large only because they are quoted in a currency with relatively little 
purchasing power. Instead, Hollywood studios (and the journalists who 
report on them) merely use nominal figures, which makes recent movies 
look successful largely because ticket prices are higher now than they were 
ten, twenty, or fifty years ago. (When Gone with the Wind came out in 
1939, a ticket cost somewhere in the range of $.50.) The most accurate way 
to compare commercial success over time would be to adjust ticket receipts 
for inflation. Earning $100 million in 1939 is a lot more impressive than 
earning $500 million in 2011. So what are the top grossing films in the U.S. 
of all time, adjusted for inflation ? 6 

1. Gone with the Wind (1939) 

2. Star Wars Episode IV (1977) 

3. The Sound of Music (1965) 

4. E.T. (1982) 

5. The Ten Commandments (1956) 

In real terms, Avatar falls to number 14; Shrek 2 falls all the way to 31st. 

Even comparing apples and apples leaves plenty of room for 
shenanigans. As discussed in the last chapter, one important role of 
statistics is to describe changes in quantities over time. Are taxes going up? 
How many cheeseburgers are we selling compared with last year? By how 


much have we reduced the arsenic in our drinking water? We often use 
percentages to express these changes because they give us a sense of scale 
and context. We understand what it means to reduce the amount of arsenic 
in the drinking water by 22 percent, whereas few of us would know whether 
reducing arsenic by one microgram (the absolute reduction) would be a 
significant change or not. Percentages don’t lie—but they can exaggerate. 
One way to make growth look explosive is to use percentage change to 
describe some change relative to a very low starting point. I live in Cook 
County, Illinois. I was shocked one day to learn that the portion of my taxes 
supporting the Suburban Cook County Tuberculosis Sanitarium District was 
slated to rise by 527 percent! However, I called off my massive antitax rally 
(which was really still in the planning phase) when I learned that this 
change would cost me less than a good turkey sandwich. The Tuberculosis 
Sanitarium District deals with roughly a hundred cases a year; it is not a 
large or expensive organization. The Chicago Sun-Times pointed out that 
for the typical homeowner, the tax bill would go from $1.15 to $6. 7 
Researchers will sometimes qualify a growth figure by pointing out that it is 
“from a low base,” meaning that any increase is going to look large by 
comparison. 

Obviously the flip side is true. A small percentage of an enormous sum 
can be a big number. Suppose the secretary of defense reports that defense 
spending will grow only 4 percent this year. Great news! Not really, given 
that the Defense Department budget is nearly $700 billion. Four percent of 
$700 billion is $28 billion, which can buy a lot of turkey sandwiches. In 
fact, that seemingly paltry 4 percent increase in the defense budget is more 
than the entire NASA budget and about the same as the budgets of the 
Labor and Treasury Departments combined. 

In a similar vein, your kindhearted boss might point out that as a matter 
of fairness, every employee will be getting the same raise this year, 10 
percent. What a magnanimous gesture—except that if your boss makes $1 
million and you make $50,000, his raise will be $100,000 and yours will be 
$5,000. The statement “everyone will get the same 10 percent raise this 
year” just sounds so much better than “my raise will be twenty times bigger 
than yours.” Both are true in this case. 


Any comparison of a quantity changing over time must have a start point 
and an end point. One can sometimes manipulate those points in ways that 
affect the message. I once had a professor who liked to speak about his 
“Republican slides” and his “Democratic slides.” He was referring to data 
on defense spending, and what he meant was that he could organize the 
same data in different ways in order to please either Democratic or 
Republican audiences. For his Republican audiences, he would offer the 
following slide with data on increases in defense spending under Ronald 
Reagan. Clearly Reagan helped restore our commitment to defense and 
security, which in turn helped to win the Cold War. No one can look at these 
numbers and not appreciate the steely determination of Ronald Reagan to 
face down the Soviets. 


Defense Spending in Billions, 1981-1988 
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For the Democrats, my former professor merely used the same (nominal) 
data, but a longer time frame. For this group, he pointed out that Jimmy 
Carter deserves credit for beginning the defense buildup. As the following 
“Democratic” slide shows, the defense spending increases from 1977 to 
1980 show the same basic trend as the increases during the Reagan 
presidency. Thank goodness that Jimmy Carter—a graduate of Annapolis 
and a former naval officer—began the process of making America strong 
again! 


Defense Spending in Billions, 1977-1988 
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Source: http://www.usgovernmentspending.com/spend.php? 

span=usgs302&year=1988&view=l&expand=30&expandC=&units=b&fy=fyl2&local=s&state=US 
&pie=#usgs302. 


While the main point of statistics is to present a meaningful picture of 
things we care about, in many cases we also hope to act on these numbers. 
NFL teams want a simple measure of quarterback quality so that they can 
find and draft talented players out of college. Firms measure the 
performance of their employees so that they can promote those who are 
valuable and fire those who are not. There is a common business aphorism: 
“You can’t manage what you can’t measure.” True. But you had better be 
darn sure that what you are measuring is really what you are trying to 
manage. 

Consider school quality. This is a crucial thing to measure, since we 
would like to reward and emulate “good” schools while sanctioning or 
fixing “bad” schools. (And within each school, we have the similar 
challenge of measuring teacher quality, for the same basic reason.) The 
most common measure of quality for both schools and teachers is test 
scores. If students are achieving impressive scores on a well-conceived 
standardized test, then presumably the teacher and school are doing a fine 
job. Conversely, bad test scores are a clear signal that lots of people should 
be fired, sooner rather than later. These statistics can take us a long way 
toward fixing our public education system, right? 

Wrong. Any evaluation of teachers or schools that is based solely on test 
scores will present a dangerously inaccurate picture. Students who walk 
through the front door of different schools have vastly different 



backgrounds and abilities. We know, for example, that the education and 
income of a student’s parents have a significant impact on achievement, 
regardless of what school he or she attends. The statistic that we’re missing 
in this case happens to be the only one that matters for our purposes: How 
much of a student’s performance, good or bad, can be attributed to what 
happens inside the school (or inside a particular classroom)? 

Students who live in affluent, highly educated communities are going to 
test well from the moment their parents drop them off at school on the first 
day of kindergarten. The flip side is also true. There are schools with 
extremely disadvantaged populations in which teachers may be doing a 
remarkable job but the student test scores will still be low—albeit not 
nearly as low as they would have been if the teachers had not been doing a 
good job. What we need is some measure of “value-added” at the school 
level, or even at the classroom level. We don’t want to know the absolute 
level of student achievement; we want to know how much that student 
achievement has been affected by the educational factors we are trying to 
evaluate. 

At first glance, this seems an easy task, as we can simply give students a 
pretest and a posttest. If we know student test scores when they enter a 
particular school or classroom, then we can measure their performance at 
the end and attribute the difference to whatever happened in that school or 
classroom. 

Alas, wrong again. Students with different abilities or backgrounds may 
also learn at different rates. Some students will grasp the material faster 
than others for reasons that have nothing to do with the quality of the 
teaching. So if students in Affluent School A and Poor School B both start 
algebra at the same time and level, the explanation for the fact that students 
at Affluent School A test better in algebra a year later may be that the 
teachers are better, or it may be that the students were capable of learning 
faster—or both. Researchers are working to develop statistical techniques 
that measure instructional quality in ways that account appropriately for 
different student backgrounds and abilities. In the meantime, our attempts to 
identify the “best” schools can be ridiculously misleading. 

Every fall, several Chicago newspapers and magazines publish a ranking 
of the “best” high schools in the region, usually on the basis of state test 
score data. Here is the part that is laugh-out-loud funny from a statistical 



standpoint: Several of the high schools consistently at the top of the 
rankings are selective enrollment schools, meaning that students must apply 
to get in, and only a small proportion of those students are accepted. One of 
the most important admissions criteria is standardized test scores. So let’s 
summarize: (1) these schools are being recognized as “excellent” for having 
students with high test scores; (2) to get into such a school, one must have 
high test scores. This is the logical equivalent of giving an award to the 
basketball team for doing such an excellent job of producing tall students. 

Even if you have a solid indicator of what you are trying to measure and 
manage, the challenges are not over. The good news is that “managing by 
statistics” can change the underlying behavior of the person or institution 
being managed for the better. If you can measure the proportion of defective 
products coming off an assembly line, and if those defects are a function of 
things happening at the plant, then some kind of bonus for workers that is 
tied to a reduction in defective products would presumably change behavior 
in the right kinds of ways. Each of us responds to incentives (even if it is 
just praise or a better parking spot). Statistics measure the outcomes that 
matter; incentives give us a reason to improve those outcomes. 

Or, in some cases, just to make the statistics look better. That’s the bad 
news. 

If school administrators are evaluated—and perhaps even compensated— 
on the basis of the high school graduation rate for students in a particular 
school district, they will focus their efforts on boosting the number of 
students who graduate. Of course, they may also devote some effort to 
improving the graduation rate, which is not necessarily the same thing. For 
example, students who leave school before graduation can be classified as 
“moving away” rather than dropping out. This is not merely a hypothetical 
example; it is a charge that was leveled against former secretary of 
education Rod Paige during his tenure as the Houston school 
superintendent. Paige was hired by President George W. Bush to be U.S. 
secretary of education because of his remarkable success in Houston in 
reducing the dropout rate and boosting test scores. 

If you’re keeping track of the little business aphorisms I keep tossing 
your way, here is another one: “It’s never a good day when 60 Minutes 
shows up at your door.” Dan Rather and the 60 Minutes II crew made a trip 



to Houston and found that the manipulation of statistics was far more 
impressive than the educational improvement.' High schools routinely 
classified students who quit high school as transferring to another school, 
returning to their native country, or leaving to pursue a General Equivalency 
Diploma (GED)—none of which count as dropping out in the official 
statistics. Houston reported a citywide dropout rate of 1.5 percent in the 
year that was examined; 60 Minutes calculated that the true dropout rate 
was between 25 and 50 percent. 

The statistical chicanery with test scores was every bit as impressive. 
One way to improve test scores (in Houston or anywhere else) is to improve 
the quality of education so that students learn more and test better. This is a 
good thing. Another (less virtuous) way to improve test scores is to prevent 
the worst students from taking the test. If the scores of the lowest- 
performing students are eliminated, the average test score for the school or 
district will go up, even if all the rest of the students show no improvement 
at all. In Texas, the statewide achievement test is given in tenth grade. There 
was evidence that Houston schools were trying to keep the weakest students 
from reaching tenth grade. In one particularly egregious example, a student 
spent three years in ninth grade and then was promoted straight to eleventh 
grade—a deviously clever way of keeping a weak student from taking a 
tenth-grade benchmark exam without forcing him to drop out (which would 
have showed up on a different statistic). 

It’s not clear that Rod Paige was complicit in this statistical trickery 
during his tenure as Houston superintendent; however, he did implement a 
rigorous accountability program that gave cash bonuses to principals who 
met their dropout and test score goals and that fired or demoted principals 
who failed to meet their targets. Principals definitely responded to the 
incentives; that’s the larger lesson. But you had better be darn certain that 
the folks being evaluated can’t make themselves look better (statistically) in 
ways that are not consistent with the goal at hand. 

The state of New York learned this the hard way. The state introduced 
“scorecards” that evaluate the mortality rates for the patients of 
cardiologists performing coronary angioplasty, a common treatment for 
heart disease. 9 This seems like a perfectly reasonable and helpful use of 
descriptive statistics. The proportion of a cardiologist’s patients who die in 


surgery is an important thing to know, and it makes sense for the 
government to collect and promulgate such data since individual consumers 
would not otherwise have access to it. So is this a good policy? Yes, other 
than the fact that it probably ended up killing people. 

Cardiologists obviously care about their “scorecard.” However, the 
easiest way for a surgeon to improve his mortality rate is not by killing 
fewer people; presumably most doctors are already trying very hard to keep 
their patients alive. The easiest way for a doctor to improve his mortality 
rate is by refusing to operate on the sickest patients. According to a survey 
conducted by the School of Medicine and Dentistry at the University of 
Rochester, the scorecard, which ostensibly serves patients, can also work to 
their detriment: 83 percent of the cardiologists surveyed said that, because 
of the public mortality statistics, some patients who might benefit from 
angioplasty might not receive the procedure; 79 percent of the doctors said 
that some of their personal medical decisions had been influenced by the 
knowledge that mortality data are collected and made public. The sad 
paradox of this seemingly helpful descriptive statistic is that cardiologists 
responded rationally by withholding care from the patients who needed it 
most. 

A statistical index has all the potential pitfalls of any descriptive statistic 
—plus the distortions introduced by combining multiple indicators into a 
single number. By definition, any index is going to be sensitive to how it is 
constructed; it will be affected both by what measures go into the index and 
by how each of those measures is weighted. For example, why does the 
NFL passer rating not include any measure of third down completions? And 
for the Human Development Index, how should a country’s literacy rate be 
weighted in the index relative to per capita income? In the end, the 
important question is whether the simplicity and ease of use introduced by 
collapsing many indicators into a single number outweighs the inherent 
inaccuracy of the process. Sometimes that answer may be no, which brings 
us back (as promised) to the U.S. News & World Report ( USNWR ) college 
rankings. 

The USNWR rankings use sixteen indicators to score and rank America’s 
colleges, universities, and professional schools. In 2010, for example, the 
ranking of national universities and liberal arts colleges used “student 
selectivity” as 15 percent of the index; student selectivity is in turn 



calculated on the basis of a school’s acceptance rate, the proportion of the 
entering students who were in the top 10 percent of their high school class, 
and the average SAT and ACT scores of entering students. The benefit of 
the USNWR rankings is that they provide lots of information about 
thousands of schools in a simple and accessible way. Even the critics 
concede that much of the information collected on America’s colleges and 
universities is valuable. Prospective students should know an institution’s 
graduation rate and the average class size. 

Of course, providing meaningful information is an enterprise entirely 
different from that of collapsing all of that information into a single ranking 
that purports to be authoritative. To critics, the rankings are sloppily 
constructed, misleading, and detrimental to the long-term interests of 
students. “One concern is simply about its being a list that claims to rank 
institutions in numerical order, which is a level of precision that those data 
just don’t support,” says Michael McPherson, the former president of 
Macalester College in Minnesota. 1 Why should alumni giving count for 5 
percent of a school’s score? And if it’s important, why does it not count for 
ten percent? 

According to U.S. News & World Report, “Each indicator is assigned a 
weight (expressed as a percentage) based on our judgments about which 
measures of quality matter most.” 11 Judgment is one thing; arbitrariness is 
another. The most heavily weighted variable in the ranking of national 
universities and colleges is “academic reputation.” This reputation is 
determined on the basis of a “peer assessment survey” filled out by 
administrators at other colleges and universities and from a survey of high 
school guidance counselors. In his general critique of rankings, Malcolm 
Gladwell offers a scathing (though humorous) indictment of the peer 
assessment methodology. He cites a questionnaire sent out by a former 
chief justice of the Michigan Supreme Court to roughly one hundred 
lawyers asking them to rank ten law schools in order of quality. Penn 
State’s was one of the law schools on the list; the lawyers ranked it near the 
middle. At the time, Penn State did not have a law school. 12 

For all the data collected by USNWR, it’s not obvious that the rankings 
measure what prospective students ought to care about: How much learning 
is going on at any given institution? Football fans may quibble about the 


composition of the passer index, but no one can deny that its component 
parts—completions, yardage, touchdowns, and interceptions—are an 
important part of a quarterback’s overall performance. That is not 
necessarily the case with the USNWR criteria, most of which focus on 
inputs (e.g., what kind of students are admitted, how much faculty are paid, 
the percentage of faculty who are full-time) rather than educational outputs. 
Two notable exceptions are the freshman retention rate and the graduation 
rate, but even those indicators do not measure learning. As Michael 
McPherson points out, “We don’t really learn anything from U.S. News 
about whether the education they got during those four years actually 
improved their talents or enriched their knowledge.” 

All of this would still be a harmless exercise, but for the fact that it 
appears to encourage behavior that is not necessarily good for students or 
higher education. For example, one statistic used to calculate the rankings is 
financial resources per student; the problem is that there is no 
corresponding measure of how well that money is being spent. An 
institution that spends less money to better effect (and therefore can charge 
lower tuition) is punished in the ranking process. Colleges and universities 
also have an incentive to encourage large numbers of students to apply, 
including those with no realistic hope of getting in, because it makes the 
school appear more selective. This is a waste of resources for the schools 
soliciting bogus applications and for students who end up applying with no 
meaningful chance of being accepted. 

Since we are about to move on to a chapter on probability, I will bet that 
the U.S. News & World Report rankings are not going away anytime soon. 
As Leon Botstein, president of Bard College, has pointed out, “People love 
easy answers. What is the best place? Number l.” 13 

The overall lesson of this chapter is that statistical malfeasance has very 
little to do with bad math. If anything, impressive calculations can obscure 
nefarious motives. The fact that you’ve calculated the mean correctly will 
not alter the fact that the median is a more accurate indicator. Judgment and 
integrity turn out to be surprisingly important. A detailed knowledge of 
statistics does not deter wrongdoing any more than a detailed knowledge of 
the law averts criminal behavior. With both statistics and crime, the bad 
guys often know exactly what they’re doing! 


* Twain attributed this phrase to British prime minister Benjamin Disraeli, but there is no record of 
Disraeli’s ever saying or writing it. 

* Available at http://www.bls.gov/data/inflation_calculator.htm. 



CHAPTER 4 


Correlation 

How does Netflix know what movies I like? 


Netflix insists that I’ll like the film Bhutto, a documentary that offers an 
“in-depth and at times incendiary look at the life and tragic death of former 
Pakistani prime minister Benazir Bhutto.” I probably will like the film 
Bhutto. (I’ve added it to my queue.) The Netflix recommendations that I’ve 
watched in the past have been terrific. And when a film is recommended 
that I’ve already seen, it’s typically one I’ve really enjoyed. 

How does Netflix do that? Is there some massive team of interns at 
corporate headquarters who have used a combination of Google and 
interviews with my family and friends to determine that I might like a 
documentary about a former Pakistani prime minister? Of course not. 
Netflix has merely mastered some very sophisticated statistics. Netflix 
doesn’t know me. But it does know what films I’ve liked in the past 
(because I’ve rated them). Using that information, along with ratings from 
other customers and a powerful computer, Netflix can make shockingly 
accurate predictions about my tastes. 

I’ll come back to the specific Netflix algorithm for making these picks; 
for now, the important point is that it’s all based on correlation. Netflix 
recommends movies that are similar to other films that I’ve liked; it also 
recommends films that have been highly rated by other customers whose 
ratings are similar to mine. Bhutto was recommended because of my five- 
star ratings for two other documentaries, Enron: The Smartest Guys in the 
Room and Fog of War. 

Correlation measures the degree to which two phenomena are related to 
one another. For example, there is a correlation between summer 
temperatures and ice cream sales. When one goes up, so does the other. Two 
variables are positively correlated if a change in one is associated with a 
change in the other in the same direction, such as the relationship between 


height and weight. Taller people weigh more (on average); shorter people 
weigh less. A correlation is negative if a positive change in one variable is 
associated with a negative change in the other, such as the relationship 
between exercise and weight. 

The tricky thing about these kinds of associations is that not every 
observation fits the pattern. Sometimes short people weigh more than tall 
people. Sometimes people who don’t exercise are skinnier than people who 
exercise all the time. Still, there is a meaningful relationship between height 
and weight, and between exercise and weight. 

If we were to do a scatter plot of the heights and weights of a random 
sample of American adults, we would expect to see something like the 
following: 


Scatter Plot for Height and Weight 



Height (inches) 


If we were to create a scatter plot of the association between exercise (as 
measured by minutes of intensive exercise per week) and weight, we would 
expect a negative correlation, with those who exercise more tending to 
weigh less. But a pattern consisting of dots scattered across the page is a 
somewhat unwieldy tool. (If Netflix tried to make film recommendations 
for me by plotting the ratings for thousands of films by millions of 
customers, the results would bury the headquarters in scatter plots.) Instead, 
the power of correlation as a statistical tool is that we can encapsulate an 



association between two variables in a single descriptive statistic: the 
correlation coefficient. 

The correlation coefficient has two fabulously attractive characteristics. 
First, for math reasons that have been relegated to the appendix, it is a 
single number ranging from -1 to 1. A correlation of 1, often described as 
perfect correlation, means that every change in one variable is associated 
with an equivalent change in the other variable in the same direction. 

A correlation of -1, or perfect negative correlation, means that every 
change in one variable is associated with an equivalent change in the other 
variable in the opposite direction. 

The closer the correlation is to 1 or -1, the stronger the association. A 
correlation of 0 (or close to it) means that the variables have no meaningful 
association with one another, such as the relationship between shoe size and 
SAT scores. 

The second attractive feature of the correlation coefficient is that it has 
no units attached to it. We can calculate the correlation between height and 
weight—even though height is measured in inches and weight is measured 
in pounds. We can even calculate the correlation between the number of 
televisions high school students have in their homes and their SAT scores, 
which I assure you will be positive. (More on that relationship in a 
moment.) The correlation coefficient does a seemingly miraculous thing: It 
collapses a complex mess of data measured in different units (like our 
scatter plots of height and weight) into a single, elegant descriptive statistic. 

How? 

As usual, I’ve put the most common formula for calculating the 
correlation coefficient in the appendix at the end of the chapter. This is not a 
statistic that you are going to be calculating by hand. (After you’ve entered 
the data, a basic software package like Microsoft Excel will calculate the 
correlation between two variables.) Still, the intuition is not that difficult. 
The formula for calculating the correlation coefficient does the following: 

1. Calculates the mean and standard deviation for both variables. If we 
stick with the height and weight example, we would then know the 
mean height for people in the sample, the mean weight for people in 
the sample, and the standard deviation for both height and weight. 



2. Converts all the data so that each observation is represented by its 
distance (in standard deviations) from the mean. Stick with me; it’s 
not that complicated. Suppose that the mean height in the sample is 
66 inches (with a standard deviation of 5 inches) and that the mean 
weight is 177 pounds (with a standard deviation of 10 pounds). Now 
suppose that you are 72 inches tall and weigh 168 pounds. We can 
also say that you your height is 1.2 standard deviations above the 
mean in height [(72 - 66)/5)] and .9 standard deviations below the 
mean in weight, or -0.9 for purposes of the formula [(168 - 177)/10]. 
Yes, it’s unusual for someone to be above the mean in height and 
below the mean in weight, but since you’ve paid good money for this 
book, I figured I should at least make you tall and thin. Notice that 
your height and weight, formerly in inches and pounds, have been 
reduced to 1.2 and -0.9. This is what makes the units go away. 

3. Here I’ll wave my hands and let the computer do the work. The 
formula then calculates the relationship between height and weight 
across all the individuals in the sample as measured by standard units. 
When individuals in the sample are tall, say, 1.5 or 2 standard 
deviations above the mean, what do their weights tend to be as 
measured in standard deviations from the mean for weight ? And 
when individuals are near to the mean in terms of height, what are 
their weights as measured in standard units? 

If the distance from the mean for one variable tends to be broadly 
consistent with distance from the mean for the other variable (e.g., people 
who are far from the mean for height in either direction tend also to be far 
from the mean in the same direction for weight), then we would expect a 
strong positive correlation. 

If distance from the mean for one variable tends to correspond to a 
similar distance from the mean for the second variable in the other direction 
(e.g., people who are far above the mean in terms of exercise tend to be far 
below the mean in terms of weight), then we would expect a strong 
negative correlation. 

If two variables do not tend to deviate from the mean in any meaningful 
pattern (e.g., shoe size and exercise) then we would expect little or no 
correlation. 



You suffered mightily in that section; we’ll get back to film rentals soon. 
Before we return to Netflix, however, let’s reflect on another aspect of life 
where correlation matters: the SAT. Yes, that SAT. The SAT Reasoning 
Test, formerly known as the Scholastic Aptitude Test, is a standardized 
exam made up of three sections: math, reading, and writing. You probably 
took the SAT, or will soon. You probably did not reflect deeply on why you 
had to take the SAT. The purpose of the test is to measure academic ability 
and predict college performance. Of course, one might reasonably ask 
(particularly those who don’t like standardized tests): Isn’t that what high 
school is for? Why is a four-hour test so important when college admissions 
officers have access to four years of high school grades? 

The answer to those questions is lurking back in Chapters 1 and 2. High 
school grades are an imperfect descriptive statistic. A student who gets 
mediocre grades while taking a tough schedule of math and science classes 
may have more academic ability and potential than a student at the same 
school with better grades in less challenging classes. Obviously there are 
even larger potential discrepancies across schools. According to the College 
Board, which produces and administers the SAT, the test was created to 
“democratize access to college for all students.” Fair enough. The SAT 
offers a standardized measure of ability that can be compared easily across 
all students applying to college. But is it a good measure of ability? If we 
want a metric that can be compared easily across students, we could also 
have all high school seniors run the 100 yard dash, which is cheaper and 
easier than administering the SAT. The problem, of course, is that 
performance in the 100 yard dash is uncorrelated with college performance. 
It’s easy to get the data; they just won’t tell us anything meaningful. 

So how well does the SAT fare in this regard? Sadly for future 
generations of high school students, the SAT does a reasonably good job of 
predicting first-year college grades. The College Board publishes the 
relevant correlations. On a scale of 0 (no correlation at all) to 1 (perfect 
correlation), the correlation between high school grade point average and 
first-year college grade point average is .56. (To put that in perspective, the 
correlation between height and weight for adult men in the United States is 
about .4.) The correlation between the SAT composite score (critical 
reading, math, and writing) and first-year college GPA is also .56. 1 That 


would seem to argue for ditching the SAT, as the test does not seem to do 
any better at predicting college performance than high school grades. In 
fact, the best predictor of all is a combination of SAT scores and high 
school GPA, which has a correlation of .64 with first-year college grades. 
Sorry about that. 

One crucial point in this general discussion is that correlation does not 
imply causation; a positive or negative association between two variables 
does not necessarily mean that a change in one of the variables is causing 
the change in the other. For example, I alluded earlier to a likely positive 
correlation between a student’s SAT scores and the number of televisions 
that his family owns. This does not mean that overeager parents can boost 
their children’s test scores by buying an extra five televisions for the house. 
Nor does it likely mean that watching lots of television is good for 
academic achievement. 

The most logical explanation for such a correlation would be that highly 
educated parents can afford a lot of televisions and tend to have children 
who test better than average. Both the televisions and the test scores are 
likely caused by a third variable, which is parental education. I can’t prove 
the correlation between TVs in the home and SAT scores. (The College 
Board does not provide such data.) However, I can prove that students in 
wealthy families have higher mean SAT scores than students in less wealthy 
families. According to the College Board, students with a family income 
over $200,000 have a mean SAT math score of 586, compared with a mean 
SAT math score of 460 for students with a family income of $20,000 or 
less. 2 Meanwhile, it’s also likely that families with incomes over $200,000 
have more televisions in their (multiple) homes than families with incomes 
of $20,000 or less. 

I began writing this chapter many days ago. Since then. I’ve had a chance to 
watch the documentary film Bhutto. Wow! This is a remarkable film about 
a remarkable family. The original footage, stretching all the way from the 
partition of India and Pakistan in 1947 to the assassination of Benazir 
Bhutto in 2007, is extraordinary. Bhutto’s voice is woven effectively 
throughout the film in the form of speeches and interviews. Anyway, I gave 
the film five stars, which is pretty much what Netflix predicted. 


At the most basic level, Netflix is exploiting the concept of correlation. 
First, I rate a set of films. Netflix compares my ratings with those of other 
customers to identify those whose ratings are highly correlated with mine. 
Those customers tend to like the films that I like. Once that is established, 
Netflix can recommend films that like-minded customers have rated highly 
but that I have not yet seen. 

That’s the “big picture.” The actual methodology is much more complex. 
In fact, Netflix launched a contest in 2006 in which members of the public 
were invited to design a mechanism that improved on existing Netflix 
recommendations by at least 10 percent (meaning that the system was 10 
percent more accurate in predicting how a customer would rate a film after 
seeing it). The winner would get $1,000,000. 

Every individual or team that registered for the contest was given 
“training data” consisting of more than 100 million ratings of 18,000 films 
by 480,000 Netflix customers. A separate set of 2.8 million ratings was 
“withheld,” meaning that Netflix knew how the customers rated these films 
but the contest participants did not. The competitors were judged on how 
well their algorithms predicted the actual customer reviews for these 
withheld films. Over three years, thousands of teams from over 180 
countries submitted proposals. There were two requirements for entry. First, 
the winner had to license the algorithm to Netflix. And second, the winner 
had to “describe to the world how you did it and why it works.” 3 

In 2009 Netflix announced a winner: a seven-person team made up of 
statisticians and computer scientists from the United States, Austria, 
Canada, and Israel. Alas, I cannot describe the winning system, even in an 
appendix. The paper explaining the system is ninety-two pages long. ' I’m 
impressed by the quality of the Netflix recommendations. Still, the system 
is just a super fancy variation on what people have been doing since the 
dawn of film: find someone with similar tastes and ask for a 
recommendation. You tend to like what I like, and to dislike what I dislike, 
so what did you think of the new George Clooney film? 

That is the essence of correlation. 


APPENDIX TO CHAPTER 4 


To calculate the correlation coefficient between two sets of numbers, you 
would perform the following steps, each of which is illustrated by use of the 
data on heights and weights for 15 hypothetical students in the table below. 

1. Convert the height of each student to standard units: (height - 
mean)/standard deviation. 

2. Convert the weight of each student to standard units: (weight - 
mean)/standard deviation. 

3. Calculate the product for each student of (weight in standard units) x 
(height in standard units). You should see that this number will be 
largest in absolute value when a student’s height and weight are both 
relatively far from the mean. 

4. The correlation coefficient is the sum of the products calculated 
above divided by the number of observations (15 in this case). The 
correlation between height and weight for this group of students is 
.83. Given that the correlation coefficient can range from -1 to 1, this 
is a relatively high degree of positive correlation, as we would expect 
with height and weight. 


A 

B 

C 

D 

E 

F 

Student 

Height 

Weight 

Height in 
standard units 

Weight in 
standard units 

(Weight in standard units) X 
(Height in standard units) 

Nick 

74 

193 

121 

0.99 

1.19 

FJana 

66 

133 

-0.63 

-0257 

042 

Dinah 

68 

1SS 

-0.17 

-0.06 

0.01 

Rebecca 

69 

147 

0.06 

-029 

-0.02 

Ben 

73 

17S 

0.98 

0.49 

0.48 

Cham 

70 

128 

0.29 

-081 

-024 

Sahar 

60 

100 

-2.00 

-1.59 

3.18 

Maggie 

63 

128 

-1.32 

-081 

1.07 

Faisal 

67 

170 

-0.40 

0.35 

-0.14 

Ted 

70 

182 

0.29 

068 

020 

Nate iso 

70 

178 

0.29 

OS7 

017 

Katrina 

70 

118 

0.29 

-1.09 

-0.32 

CJ 

7 S 

227 

1.44 

1.93 

2.77 

Sophia 

62 

115 

-1.54 

-1.17 

1.81 

Will 

74 

211 

121 

1.49 

1.80 







Mean 

68.73 

157.33 



Total = 12.39 

Standard 

Deviation 

4,36 

3&12 


Correlation coefficient - Total/n - 12.39/15 = 083 































The formula for calculating the correlation coefficient requires a little 
detour with regard to notation. The figure £, known as the summation sign, 
is a handy character in statistics. It represents the summation of the quantity 
that comes after it. For example, if there is a set of observations x h x 2 , x 3 , 
and x 4 , then £ (x^ tells us that we should sum the four observations: x x + x 2 
+ x 3 + x 4 . Thus, X ( x i) - x 3 + x 2 + x 3 + x 4 . Our formula for the mean of a 
set of i observations could be represented as the following: mean = £ (x^/n. 

a 

We can make the formula even more adaptable by writing h which 
sums the quantity x 3 + x 2 + x 3 + . . . x n , or, in other words, all the terms 
beginning with x 3 (because i = 1) up to x n (because i = n). Our formula for 
the mean of a set of n observations could be represented as the following: 

mein ■ £ W/ n 

>-i 


Given that general notation, the formula for calculating the correlation 
coefficient, r, for two variables x and y is the following: 


r = 



(i,-x)(yi-y) 


where 

n = the number of observations; 
x is the mean for variable x; 
y is the mean for variable y; 
a x is the standard deviation for variable x; 

a y is the standard deviation for variable y. 

Any statistical software program with statistical tools can also calculate 
the correlation coefficient between two variables. In the student height and 
weight example, using Microsoft Excel yields the same correlation between 
height and weight for the fifteen students as the hand calculation in the 
chart above: 0.83. 


* You can read it at http://www.netflixprize.com/assets/GrandPrize2009_BPC_PragmaticTheory.pdf. 


CHAPTER 5 


Basic Probability 
Don \ buy the extended warranty on your $99 printer 


In 1981, the Joseph Schlitz Brewing Company spent $1.7 million for what 
appeared to be a shockingly bold and risky marketing campaign for its 
flagging brand, Schlitz. At halftime of the Super Bowl, in front of 100 
million people around the world, the company broadcast a live taste test 
pitting Schlitz Beer against a key competitor, Michelob. 1 Bolder yet, the 
company did not pick random beer drinkers to evaluate the two beers; it 
picked 100 Michelob drinkers. This was the culmination of a campaign that 
had run throughout the NFL playoffs. 2 There were five live television taste 
tests in all, each of which had 100 consumers of a competing brand 
(Budweiser, Miller, or Michelob) conduct a blind taste test between their 
supposed favorite beer and Schlitz. Each of the beer taste-offs was 
promoted aggressively, just like the playoff game during which it would be 
held (e.g., “Watch Schlitz v. Bud, Live during the AFC Playoffs”). 

The marketing message was clear: Even beer drinkers who think they 
like another brand will prefer Schlitz in a blind taste test. For the Super 
Bowl spot, Schlitz even hired a former NFL referee to oversee the test. 
Given the risky nature of conducting blind taste tests in front of huge 
audiences on live TV, one can assume that Schlitz produced a spectacularly 
delicious beer, right? 

Not necessarily. Schlitz needed only a mediocre beer and a solid grasp of 
statistics to know that this ploy—a term I do not use lightly, even when it 
comes to beer advertising—would almost certainly work out in its favor. 
Most beers in the Schlitz category taste about the same; ironically, that is 
exactly the fact that this advertising campaign exploited. Assume that the 
typical beer drinker off the street cannot tell Schlitz from Budweiser from 
Michelob from Miller. In that case, a blind taste test between any two of the 
beers is essentially a coin flip. On average, half the taste testers will pick 


Schlitz, and half will pick the beer it is “challenging.” This fact alone would 
probably not make a particularly effective advertising campaign. (“You 
can’t tell the difference, so you might as well drink Schlitz.”) And Schlitz 
absolutely, positively would not want to do this test among its own loyal 
customers; roughly half of these Schlitz drinkers would pick the competing 
beer. It looks bad when the beer drinkers supposedly most committed to 
your brand choose a competitor in a blind taste test—which is exactly what 
Schlitz was trying to do to its competitors. 

Schlitz did something cleverer. The genius of the campaign was 
conducting the taste test exclusively among beer drinkers who stated that 
they preferred a competing beer. If the blind taste test is really just a coin 
flip, then roughly half of the Budweiser or Miller or Michelob drinkers will 
end up picking Schlitz. That makes Schlitz look really good. Half of all Bud 
drinkers like Schlitz better! 

And it looks particularly good at halftime of the Super Bowl with a 
former NFL referee (in uniform) conducting the taste test. Still, it’s live 
television. Even if the statisticians at Schlitz had determined with loads of 
previous private trials that the typical Michelob drinker will pick Schlitz 50 
percent of the time, what if the 100 Michelob drinkers taking the test at 
halftime of the Super Bowl turn out to be quirky? Yes, the blind taste test is 
the equivalent of a coin toss, but what if most of the tasters chose Michelob 
just by chance ? After all, if we lined up the same 100 guys and asked them 
to flip a coin, it’s entirely possible that they would flip 85 or 90 tails. That 
kind of bad luck in the taste test would be a disaster for the Schlitz brand 
(not to mention a waste of the $1.7 million for the live television coverage). 

Statistics to the rescue! If there were some kind of statistics superhero,* 
this is when he or she would have swooped into the Schlitz corporate 
headquarters and unveiled the details of what statisticians call a binomial 
experiment (also called a Bernoulli trial). The key characteristics of a 
binomial experiment are that we have a fixed number of trials (e.g., 100 
taste testers), each with two possible outcomes (Schlitz or Michelob), and 
the probability of “success” is the same in each trial. (I am assuming the 
probability of picking one beer or the other is 50 percent, and I am defining 
“success” as a tester picking Schlitz.) We also assume that all the “trials” 


are independent, meaning that one blind taste tester’s decision has no 
impact on any other tester’s decision. 

With only this information, a statistical superhero can calculate the 
probability of all the different outcomes for the 100 trials, such as 52 Schlitz 
and 48 Michelob or 31 Schlitz and 69 Michelob. Those of us who are not 
statistical superheroes can use a computer to do the same thing. The 
chances of all 100 taste testers picking Michelob were 1 in 
1,267,650,600,228,229,401,496,703,205,376. There was probably a bigger 
chance that all of the testers would be killed at halftime by an asteroid. 
More important, the same basic calculations can give us the cumulative 
probability for a range of outcomes, such as the chances that 40 or fewer 
testers pick Schlitz. These numbers would clearly have assuaged the fears 
of the Schlitz marketing folks. 

Let’s assume that Schlitz would have been pleased if at least 40 of the 
100 tasters picked Schlitz—an impressive number given that all of the men 
taking the live blind taste test had professed to be Michelob drinkers. An 
outcome at least that good was highly likely. If the taste test is really like a 
flip of the coin, then basic probability tells us that there was a 98 percent 
chance that at least 40 of the tasters would pick Schlitz, and an 86 percent 
chance that at least 45 of the tasters would. ^ In theory, this wasn’t a very 
risky gambit at all. 

So what happened to Schlitz? At halftime of the 1981 Super Bowl, 
exactly 50 percent of the Michelob drinkers chose Schlitz in the blind taste 
test. 

There are two important lessons here: probability is a remarkably 
powerful tool, and many leading beers in the 1980s were indistinguishable 
from one another. This chapter will focus primarily on the first lesson. 

Probability is the study of events and outcomes involving an element of 
uncertainty. Investing in the stock market involves uncertainty. So does 
flipping a coin, which may come up heads or tails. Flipping a coin four 
times in a row involves additional layers of uncertainty, because each of the 
four flips can result in a head or a tail. If you flip a coin four times in a row, 
I cannot know the outcome in advance with certainty (nor can you). Yet I 
can determine in advance that some outcomes (two heads, two tails) are 
more likely than others (four heads). As the folks at Schlitz reckoned, those 


kinds of probability-based insights can be extremely helpful. In fact, if you 
can understand why the probability of flipping four heads in a row with a 
fair coin is 1 in 16, you can (with some work) understand everything from 
how the insurance industry works to whether a pro football team should 
kick the extra point after a touchdown or go for a two-point conversion. 

Let’s start with the easy part: Many events have known probabilities. The 
probability of flipping heads with a fair coin is V 2 . The probability of rolling 
a one with a single die is v*. Other events have probabilities that can be 
inferred on the basis of past data. The probability of successfully kicking 
the extra point after touchdown in professional football is .94, meaning that 
kickers make, on average, 94 out of every 100 extra-point attempts. 
(Obviously this figure might vary slightly for different kickers, under 
different weather circumstances, and so on, but it’s not going to change 
radically.) Simply having and appreciating this kind of information can 
often clarify decision making and render risks explicit. For example, the 
Australian Transport Safety Board published a report quantifying the 
fatality risks for different modes of transport. Despite widespread fear of 
flying, the risks associated with commercial air travel are tiny. Australia 
hasn’t had a commercial air fatality since the 1960s, so the fatality rate per 
100 million kilometers traveled is essentially zero. The rate for drivers is .5 
fatalities per 100 million kilometers traveled. The really impressive number 
is for motorcycles—if you aspire to be an organ donor. The fatality rate is 
thirty-five times higher for motorcycles than for cars. 3 

In September of 2011, a 6.5-ton NASA satellite was plummeting to earth 
and was expected to break apart once it hit the earth’s atmosphere. What 
were the chances of being struck by the debris? Should I have kept the kids 
home from school? The rocket scientists at NASA estimated that the 
probability of any individual person’s being hit by a part of the falling 
satellite was 1 in 21 trillion. Yet the chances that anyone anywhere on earth 
might get hit were a more sobering 1 in 3,200.* In the end, the satellite did 
break apart on reentry, but scientists aren’t entirely certain where all the 
pieces ended up. 4 No one reported being hurt. Probabilities do not tell us 
what will happen for sure; they tell us what is likely to happen and what is 
less likely to happen. Sensible people can make use of these kinds of 
numbers in business and life. For example, when you hear on the radio that 


a satellite is plummeting to earth, you should not race home on your 
motorcycle to warn the family. 

When it comes to risk, our fears do not always track with what the 
numbers tell us we should be afraid of. One of the striking findings from 
Freakonomics, by Steve Levitt and Stephen Dubner, was that swimming 
pools in the backyard are far more dangerous than guns in the closet. Levitt 
and Dubner calculate that a child under ten is one hundred times more 
likely to die in a swimming pool than from a gun accident.^ An intriguing 
paper by three Cornell researchers, Garrick Blalock, Vrinda Kadiyali, and 
Daniel Simon, found that thousands of Americans may have died since the 
September 11 attacks because they were afraid to fly. e We will never know 
the true risks associated with terrorism; we do know that driving is 
dangerous. When more Americans opted to drive rather than to fly after 
9/11, there were an estimated 344 additional traffic deaths per month in 
October, November, and December of 2001 (taking into account the 
average number of fatalities and other factors that typically contribute to 
road accidents, such as weather). This effect dissipated over time, 
presumably as the fear of terrorism diminished, but the authors of the study 
estimate that the September 11 attacks may have caused more than 2,000 
driving deaths. 

Probability can also sometimes tell us after the fact what likely happened 
and what likely did not happen—as in the case of DNA analysis. When the 
technicians on CSI: Miami find a trace of saliva on an apple core near a 
murder victim, that saliva does not have the murderer’s name on it, even 
when viewed under a powerful microscope by a very attractive technician. 
Instead, the saliva (or hair, or skin, or bone fragment) will contain a DNA 
segment. Each DNA segment in turn has regions, or loci, that can vary from 
individual to individual (except for identical twins, who share the same 
DNA). When the medical examiner reports that a DNA sample is a 
“match,” that’s only part of what the prosecution has to prove. Yes, the loci 
tested on the DNA sample from the crime scene must match the loci on the 
DNA sample taken from the suspect. However, the prosecutors must also 
prove that the match between the two DNA samples is not merely a 
coincidence. 


Humans share similarities in their DNA, just as we share other 
similarities: shoe size, height, eye color. (More than 99 percent of all DNA 
is identical among all humans.) If researchers have access to only a small 
sample of DNA on which only a few loci can be tested, it’s possible that 
thousands or even millions of individuals may share that genetic fragment. 
Therefore, the more loci that can be tested, and the more natural genetic 
variation there is in each of those loci, the more certain the match becomes. 
Or, to put it a bit differently, the less likely it becomes that the DNA sample 
will match more than one person. 7 

To get your mind around this, imagine that your “DNA number” consists 
of your phone number attached to your Social Security number. This 
nineteen-digit sequence uniquely identifies you. Consider each digit a 
“locus” with ten possibilities: 0, 1, 2, 3, and so on. Now suppose that crime 
scene investigators find the remnant of a “DNA number” at a crime scene: _ 

_459 _4 _ 0 _ 9 8 1 7_. This happens to match exactly with your 

“DNA number.” Are you guilty? 

You should see three things. First, anything less than a full match of the 
entire genome leaves some room for uncertainty. Second, the more “loci” 
that can be tested, the less uncertainty remains. And third, context matters. 
This match would be extremely compelling if you also happened to be 
caught speeding away from the crime scene with the victim’s credit cards in 
your pocket. 

When researchers have unlimited time and resources, the typical process 
involves testing thirteen different loci. The chances that two people share 
the same DNA profile across all thirteen loci are extremely low. When 
DNA was used to identify the remains found in the World Trade Center 
after September 11, samples found at the scene were matched to samples 
provided by family members of the victims. The probability required to 
establish positive identification was one in a billion, meaning that the 
probability that the discovered remains belonged to someone other than the 
identified victim had to be judged as one in one billion or less. Later in the 
search, this standard was relaxed, as there were fewer unidentified victims 
with whom the remains could be confused. 

When resources are limited, or the available DNA sample is too small or 
too contaminated for thirteen loci to be tested, things get more interesting 


and controversial. The Los Angeles Times ran a series in 2008 examining 
the use of DNA as criminal evidence. 8 In particular, the Times questioned 
whether the probabilities typically used by law enforcement understate the 
likelihood of coincidental matches. (Since no one knows the DNA profile 
of the entire population, the probabilities presented in court by the FBI and 
other law enforcement entities are estimates.) The intellectual pushback was 
instigated when a crime lab analyst in Arizona running tests with the state’s 
DNA database discovered two unrelated felons whose DNA matched at 
nine loci; according to the FBI, the chances of a nine-loci match between 
two unrelated persons are 1 in 113 billion. Subsequent searches of other 
DNA databases turned up more than a thousand human pairs with genetic 
matches at nine loci or more. I’ll leave this issue for law enforcement and 
defense lawyers to work out. For now, the lesson is that the dazzling science 
of DNA analysis is only as good as the probabilities used to support it. 

Often it is extremely valuable to know the likelihood of multiple events’ 
happening. What is the probability that the electricity goes out and the 
generator doesn’t work? The probability of two independent events’ both 
happening is the product of their respective probabilities. In other words, 
the probability of Event A happening and Event B happening is the 
probability of Event A multiplied by the probability of Event B. An 
example makes it much more intuitive. If the probability of flipping heads 
with a fair coin is V 2 , then the probability of flipping heads twice in a row is 
V 2 X V 2 , or 14. The probability of flipping three heads in a row is Vs, the 
probability of four heads in a row is 1/16, and so on. (You should see that 
the probability of throwing four tails in a row is also 1/16.) This explains 
why the system administrator at your school or office is constantly on your 
case to improve the “quality” of your password. If you have a six-digit 
password using only numerical digits, we can calculate the number of 
possible passwords: 10 x 10 x 10 x 10 x 10 x 10, which equals 10 6 , or 
1,000,000. That sounds like a lot of possibilities, but a computer could blow 
through all 1,000,000 possible combinations in a fraction of a second. 

So let’s suppose that your system administrator harangues you long 
enough that you include letters in your password. At that point, each of the 
6 digits now has 36 combinations: 26 letters and 10 digits. The number of 
possible passwords grows to 36 X 36 x 36 X 36 x 36 X 36, or 36 6 , 


which is over two billion. If your administrator demands eight digits and 
urges you to use symbols like #, % and !, as the University of Chicago 

does, the number of potential passwords climbs to 46 8 , or just over 20 
trillion. 

There is one crucial distinction here. This formula is applicable only if 
the events are independent, meaning that the outcome of one has no effect 
on the outcome of another. For example, the probability that you throw 
heads on the first flip does not change the likelihood of your throwing heads 
on the second flip. On the other hand, the probability that it rains today is 
not independent of whether it rained yesterday, since storm fronts can last 
for days. Similarly, the probability of crashing your car today and crashing 
your car next year are not independent. Whatever caused you to crash this 
year might also cause you to crash next year; you might be prone to drunk 
driving, drag racing, texting while driving, or just driving badly. (This is 
why your auto insurance rates go up after an accident; it is not simply that 
the company wants to recover the money that it has paid out for the claim; 
rather, it now has new information about your probability of crashing in the 
future, which—after you’ve driven the car through your garage door—has 
gone up.) 

Suppose you are interested in the probability that one event happens or 
another event happens: outcome A or outcome B (again assuming that they 
are independent). In this case, the probability of getting A or B consists of 
the sum of their individual probabilities: the probability of A plus the 
probability of B. For example, the likelihood of throwing a 1, 2, or 3, with a 
single die is the sum of their individual probabilities: */d + v<, + % = . = V 2 . 
This should make intuitive sense. There are six possible outcomes for the 
roll of a die. The numbers 1, 2, and 3 collectively make up half of those 
possible outcomes. Therefore you have a 50 percent chance of rolling a 1, 
2, or 3. If you are playing craps in Las Vegas, the chance of rolling a 7 or 11 
in a single throw is the number of combinations that sum to 7 or 11 divided 
by the total number of combinations that can be thrown with two dice, or 


Probability also enables us to calculate what might be the most useful 
tool in all of managerial decision making, particularly finance: expected 
value. The expected value takes basic probability one step further. The 
expected value or payoff from some event, say purchasing a lottery ticket, is 
the sum of all the different outcomes, each weighted by its probability and 
payoff. As usual, an example makes this clearer. Suppose you are invited to 
play a game in which you roll a single die. The payoff to this game is $1 if 
you roll a 1; $2 if you roll a 2; $3 if you roll a 3; and so on. What is the 
expected value for a single roll of the die? Each possible outcome has a % 
probability, so the expected value is: 

'/< ($1) + Vt ($2) + Vo ($3) + 1 v. ($4) + 1 >. ($5) + *c ($6) = : V(y, or $3.50. 

At first glance, the expected value of $3.50 might appear to be a 
relatively useless figure. After all, you can’t actually earn $3.50 with a 
single roll of the die (since your payoff has to be a whole number). In fact, 
the expected value turns out to be extremely powerful because it can tell 
you whether a particular event is “fair,” given its price and expected 
outcome. Suppose you have the chance to play the above game for $3 a 
throw. Does it make sense to play? Yes, because the expected value of the 
outcome ($3.50) is higher than the cost of playing ($3.00). This does not 
guarantee that you will make money by playing once, but it does help 
clarify which risks are worth taking and which are not. 

We can take this hypothetical example and apply it to professional 
football. As noted earlier, after a touchdown, teams have a choice between 
kicking an extra point and attempting a two-point conversion. The former 
involves kicking the ball through the goalposts from the three yard line; the 
latter involves running or passing it into the end zone from the three yard 
line, which is significantly more difficult. Teams can choose the easy option 
and get one point, or they can choose the harder option and get two points. 
What to do? 

Statisticians may not play football or date cheerleaders, but they can 
provide statistical guidance for football coaches. 9 As pointed out earlier, the 
probability of making the kick after a touchdown is .94. This means that the 
expected value of a point-after attempt is also .94, since it equals the payoff 
(1 point) multiplied by the probability of success (.94). No team ever scores 
.94 points, but this figure is helpful in quantifying the value of attempting 


this option after a touchdown relative to the alternative, which is the two- 
point conversion. The expected value of “going for two” is much lower: 
.74. Yes, the payoff is higher (2 points), but the success rate is dramatically 
lower (.37). Obviously if there is one second left in the game and a team is 
behind by two points after scoring a touchdown, it has no choice but to go 
for a two-point conversion. But if a team’s goal is to maximize points 
scored over time, then kicking the extra point is the strategy that will do 
that. 

The same basic analysis can illustrate why you should never buy a lottery 
ticket. In Illinois, the probabilities associated with the various possible 
payoffs for the game are printed on the back of each ticket. I purchased a $1 
instant ticket. (Note to self: Is this tax deductible?) On the back—in tiny, 
tiny print—are the chances of winning different cash prizes, or a free new 
ticket: 1 in 10 (free ticket); 1 in 15 ($2); 1 in 42.86 ($4); 1 in 75 ($5); and so 
on up to the 1 in 40,000 chance of winning $1,000. I calculated the 
expected payout for my instant ticket by adding up each possible cash prize 
weighted by its probability. " It turns out that my $1 lottery ticket has an 
expected payout of roughly $.56, making it an absolutely miserable way to 
spend $1. As luck would have it, I won $2. 

My $2 prize notwithstanding, buying the ticket was a stupid thing to do. 
This is one of the crucial lessons of probability. Good decisions—as 
measured by the underlying probabilities—can turn out badly. And bad 
decisions—like spending $1 on the Illinois lottery—can still turn out well, 
at least in the short run. But probability triumphs in the end. An important 
theorem known as the law of large numbers tells us that as the number of 
trials increases, the average of the outcomes will get closer and closer to its 
expected value. Yes, I won $2 playing the lotto today. And I might win $2 
again tomorrow. But if I buy thousands of $1 lottery tickets, each with an 
expected payout of $.56, then it becomes a near mathematical certainty that 
I will lose money. By the time I’ve spent $1 million on tickets. I’m going to 
end up with something strikingly close to $560,000. 

The law of large numbers explains why casinos always make money in 
the long run. The probabilities associated with all casino games favor the 
house (assuming that the casino can successfully prevent blackjack players 
from counting cards). If enough bets are wagered over a long enough time. 


the casino will be certain to win more than it loses. The law of large 
numbers also demonstrates why Schlitz was much better off doing 100 
blind taste tests at halftime of the Super Bowl rather than just 10. Check out 
the “probability density functions” for a Schlitz type of test with 10, 100, 
and 1,000 trials. (Although it sounds fancy, a probability density function 
merely plots the assorted outcomes along the x-axis and the expected 
probability of each outcome on the y-axis; the weighted probabilities—each 
outcome multiplied by its expected frequency—will add up to 1.) Again 
I’m assuming that the taste test is just like a coin flip and each tester has a 
.5 probability of choosing Schlitz. As you can see below, the expected 
outcome converges around 50 percent of tasters’ choosing Schlitz as the 
number of tasters gets larger. At the same time, the probability of getting an 
outcome that deviates sharply from 50 percent falls sharply as the number 
of trials gets large. 


10 Trials 



100 Trials 





1,000 Trials 



I stipulated earlier that Schlitz executives would be happy if 40 percent 
or more of the Michelob drinkers chose Schlitz in the blind taste test. The 
figures below reflect the probability of getting that outcome as the number 
of tasters gets larger: 

10 blind taste testers: .83 
100 blind taste testers: .98 
1,000 blind taste testers: .9999999999 
1,000,000 blind taste testers: 1 

By now the intuition is obvious behind the chapter subtitle, “Don’t buy 
the extended warranty on your $99 printer.” Okay, maybe that’s not so 
obvious. Let me back up. The entire insurance industry is built on 
probability. (A warranty is just a form of insurance.) When you insure 
anything, you are contracting to receive some specified payoff in the event 
of a clearly defined contingency. For example, your auto insurance will 
replace your car in the event that it gets stolen or crushed by a tree. In 







exchange for this guarantee, you agree to pay some fixed amount of money 
for the period in which you are insured. The key idea is that in exchange for 
a regular and predictable payment, you have transferred to the insurance 
company the risk of having your car stolen, crushed, or even totaled by your 
own bad driving. 

Why are these companies willing to assume such risks? Because they 
will earn large profits in the long run if they price their premiums correctly. 
Obviously some cars insured by Allstate will get stolen. Others will get 
totaled when their owners drive over a fire hydrant, as happened to my high 
school girlfriend. (She also had to replace the fire hydrant, which is far 
more expensive than you might think.) But most cars insured by Allstate or 
any other company will be just fine. To make money, the insurance 
company need only collect more in premiums than it pays out in claims. 
And to do that, the firm must have a solid grasp of what is known in 
industry jargon as the “expected loss” on every policy. This is exactly the 
same concept as expected value, only with an insurance twist. If your car is 
insured for $40,000, and the chances of its getting stolen in any given year 
are 1 in 1,000, then the annual expected loss on your car is $40. The annual 
premium for the theft portion of the coverage needs to be more than $40. At 
that point, the insurance company becomes just like the casino or the 
Illinois lottery. Yes, there will be payouts, but over the long run what comes 
in will be more than what goes out. 

As a consumer, you should recognize that insurance will not save you 
money in the long run. What it will do is prevent some unacceptably high 
loss, such as replacing a $40,000 car that was stolen or a $350,000 house 
that burned down. Buying insurance is a “bad bet” from a statistical 
standpoint since you will pay the insurance company, on average, more than 
you get back. Yet it can still be a sensible tool for protecting against 
outcomes that would otherwise wreck your life. Ironically, someone as rich 
as Warren Buffett can save money by not purchasing car insurance, 
homeowner’s insurance, or even health insurance because he can afford 
whatever bad things might happen to him. 

Which finally brings us back to your $99 printer! We’ll assume that 
you’ve just picked out the perfect new laser printer at Best Buy or some 
other retailer." When you reach the checkout counter, the sales assistant will 


offer you a series of extended warranty options. For another $25 or $50, 
Best Buy will fix or replace the printer should it break in the next year or 
two. On the basis of your understanding of probability, insurance, and basic 
economics, you should immediately be able to surmise all of the following: 
(1) Best Buy is a for-profit business that seeks to maximize profits. (2) The 
sales assistant is eager for you to buy the extended warranty. (3) From 
numbers 1 and 2, we can infer that the cost of the warranty to you is greater 
than the expected cost of fixing or repairing the printer for Best Buy. If this 
were not the case, Best Buy would not be so aggressive in trying to sell it to 
you. (4) If your $99 printer breaks and you have to pay out of pocket to fix 
or replace it, this will not meaningfully change your life. 

On average, you’ll pay more for the extended warranty than you would 
to repair the printer. The broader lesson—and one of the core lessons of 
personal finance—is that you should always insure yourself against any 
adverse contingency that you cannot comfortably afford to withstand. You 
should skip buying insurance on everything else. 

Expected value can also help us untangle complex decisions that involve 
many contingencies at different points in time. Suppose a friend of yours 
has asked you to invest $1 million in a research venture examining a new 
cure for male pattern baldness. You would probably ask what the likelihood 
of success will be; you’ll get a complicated answer. This is a research 
project, so there is only a 30 percent chance that the team will discover a 
cure that works. If the team does not find a cure, you will get $250,000 of 
your investment back, as those funds will have been reserved for taking the 
drug to market (testing, marketing, etc.) Even if the researchers are 
successful, there is only a 60 percent chance that the U.S. Food and Drug 
Administration will approve the new miracle baldness cure as safe for use 
on humans. Even then, if the drug is safe and effective, there is a 10 percent 
chance that a competitor will come to market with a better drug at about the 
same time, wiping out any potential profits. If everything goes well—the 
drug is safe, effective, and unchallenged by competitors—then the best 
estimate on the return on your investment is $25 million. 

Should you make the investment? 

This seems like a muddle of information. The potential payday is huge— 
25 times your initial investment—but there are so many potential pitfalls. A 



decision tree can help organize this kind of information and—if the 
probabilities associated with each outcome are correct—give you a 
probabilistic assessment of what you ought to do. The decision tree maps 
out each source of uncertainty and the probabilities associated with all 
possible outcomes. The end of the tree gives us all the possible payoffs and 
the probability of each. If we weight each payoff by its likelihood, and sum 
all the possibilities, we will get the expected value of this investment 
opportunity. As usual, the best way to understand this is to take a look. 

The Investment Decision 


SI million 
investment 


$25 million 



‘$250,000 


(.3)(.6)(.9X$25 million) 
= (. 162)(S25 million) 

= $4,050,000 


(.3)(.6)(.l)(SO) 
= 0.018(80) 

= so 

(.3)(.4)($0) 

= 0.12 ($ 0 ) 

= $0 


= (.7)($250,000) 
= $175,000 


Expected payoff = $4,050,000 + $0 + $0 + $175,000 
= $4,225,000 


This particular opportunity has an attractive expected value. The 
weighted payoff is $4,225 mill ion. Still, this investment may not be the 
wisest thing to do with the college tuition money that you’ve squirreled 
away for your children. The decision tree lets you know that your expected 
payoff is far higher than what you are being asked to invest. On the other 
hand, the most likely outcome, meaning the one that will happen most 
often, is that the company will not discover a cure for baldness and you will 
get only $250,000 back. Your appetite for this investment might depend on 
your risk profile. The law of large numbers suggests that an investment 
firm, or a rich individual like Warren Buffet, should seek out hundreds of 
opportunities like this with uncertain outcomes but attractive expected 
returns. Some will work; many won’t. On average, these investors will 



make a lot of money, just like an insurance company or a casino. If the 
expected payoff is in your favor, more trials are always better. 

The same basic process can be used to explain a seemingly 
counterintuitive phenomenon. Sometimes it does not make sense to screen 
the entire population for a rare but serious disease, such as HIV/AIDS. 
Suppose we can test for some rare disease with a high degree of accuracy. 
For the sake of example, let’s assume the disease affects 1 of every 100,000 
adults and the test is 99.9999 percent accurate. The test never generates a 
false negative (meaning that it never misses someone who has the disease); 
however, roughly 1 in 10,000 tests conducted on a healthy person will 
generate a false positive, meaning that the person tests positive but does not 
actually have the disease. The striking outcome here is that despite the 
impressive accuracy of the test, most of the people who test positive will not 
have the disease. This will generate enormous anxiety among those who 
falsely test positive; it can also waste finite health care resources on follow¬ 
up tests and treatment. 

If we test the entire American adult population, or roughly 175 million 
people, the decision tree looks like the following: 


Widespread Screening for a Rare Disease 



Have disease 
.00001 


175 million 
adult Americans 


.99999 
Not have 
disease 



positive 

.0001 


1,750 



17,500 



174,980,750 


Have disease and test 
positive 


-► Have disease and test 
negative 

-► Do not have disease 
and test positive 


-► Do not have disease 
and test negative 


People with disease 
Those told they 
have the disease 


1,750 


1,750 


1,750+ 17,500 19,250 


= .09 = 9% 


Only 1,750 adults have the disease. They all test positive. Over 174 
million adults do not have the disease. Of this healthy group who are tested, 
99.9999 get the correct result that they do not have the disease. Only .0001 
get a false positive. But .0001 of 174 mill ion is still a big number. In fact, 
17,500 people will, on average, get false positives. 

Let’s look at what that means. A total of 19,250 people are notified that 
they have the disease; only 9 percent of them are actually sick! And that’s 
with a test that has a very low rate of false positives. Without going too far 
off topic, this should give you some insight into why cost containment in 
health care sometimes involves less screening of healthy people for 
diseases, not more. In the case of a disease like HIV/AIDS, public health 
officials will often recommend that the resources available be used to 
screen the populations at highest risk, such as gay men or intravenous drug 


users. 



Sometimes probability helps us by flagging suspicious patterns. Chapter 1 
introduced the problem of institutionalized cheating on standardized tests 
and one of the firms that roots it out, Caveon Test Security. The Securities 
and Exchange Commission (SEC), the government agency responsible for 
enforcing federal laws related to securities trading, uses a similar 
methodology for catching inside traders. (Inside trading involves illegally 
using private information, such as a law firm’s knowledge of an impending 
corporate acquisition, to trade stock or other securities in the affected 
companies.) The SEC uses powerful computers to scrutinize hundreds of 
millions of stock trades and look for suspicious activity, such as a big 
purchase of shares in a company just before a takeover is announced, or the 
dumping of shares just before a company announces disappointing 
earnings. 10 The SEC will also investigate investment managers with 
unusually high returns over long periods of time. (Both economic theory 
and historical data suggest that it is extremely difficult for a single investor 
to get above-average returns year after year.) Of course, smart investors are 
always trying to anticipate good and bad news and to devise perfectly legal 
strategies that consistently beat the market. Being a good investor does not 
necessarily make one a criminal. How does a computer tell the difference? I 
called the enforcement division of the SEC several times to ask what 
particular patterns are most likely to signal criminal activity. They still have 
not called me back. 

In the 2002 film Minority Report, Tom Cruise plays a “pre-crime” detective 
who is part of a bureau that uses technology to predict crimes before they’re 
committed. 

Well folks, that’s not science fiction anymore. In 2011, the New York 
Times ran the following headline: “Sending the Police before There’s a 
Crime.” 11 The story described how detectives were dispatched to a parking 
garage in downtown Santa Cruz by a computer program that predicted that 
there was a high likelihood of burglaries from cars at that location on that 
day. Police subsequently arrested two women peering into car windows. 
One had outstanding arrest warrants; the other was carrying illegal drugs. 

The Santa Cruz system was designed by two mathematicians, an 
anthropologist, and a criminologist. The Chicago Police Department has 
created an entire predictive analytics unit, in part because gang activity, the 


source of much of the city’s violence, follows certain patterns. The book 
Data Mining and Predictive Analysis: Intelligence Gathering and Crime 
Analysis, a guide to statistics for law enforcement, begins enthusiastically, 
“It is now possible to predict the future when it comes to crime, such as 
identifying crime trends, anticipating hotspots in the community, refining 
resource deployment decisions, and ensuring the greatest protection for 
citizens in the most efficient manner.” (Look, I read these kinds of things so 
that you don’t have to.) 

“Predictive policing” is part of a broader movement called predictive 
analytics. Crime will always involve an element of uncertainty, as will 
determining who is going to crash his car or default on her mortgage. 
Probability helps us navigate those risks. And information refines our 
understanding of the relevant probabilities. Businesses facing uncertainty 
have always sought to quantify their risks. Lenders request things like 
income verification and a credit score. Yet these blunt credit instruments are 
starting to feel like the prediction equivalent of a caveman’s stone tools. 
The confluence of huge amounts of digital data and cheap computing power 
has generated fascinating insights into human behavior. Insurance officials 
correctly describe their business as the “transfer of risk”—and so they had 
better understand the risks being transferred to them. Companies like 
Allstate are in the business of knowing things that might otherwise seem 
like random trivia: 12 

• Twenty to twenty-four-year-old drivers are the most likely to be 
involved in a fatal crash. 

• The most commonly stolen car in Illinois is the Honda Civic (as 
opposed to full-size Chevrolet pickups in Alabama). 

• Texting while driving causes crashes, but state laws banning the 
practice do not seem to stop drivers from doing it. In fact, such laws 
might even make things worse by prompting drivers to hide their 
phones and therefore take their eyes off the road while texting. 

The credit card companies are at the forefront of this kind of analysis, 
both because they are privy to so much data on our spending habits and 
because their business model depends so heavily on finding customers who 
are just barely a good credit risk. (The customers who are the best credit 


risks tend to be money losers because they pay their bills in full each 
month; the customers who carry large balances at high interest rates are the 
ones who generate fat profits—as long as they don’t default.) One of the 
most intriguing studies of who is likely to pay a bill and who is likely to 
walk away was generated by J. P. Martin, “a math-loving executive” at 
Canadian Tire, a large retailer that sells a wide range of automotive 
products and other retail goods. 13 When Martin analyzed the data—every 
transaction using a Canadian Tire credit card from the prior year—he 
discovered that what customers purchased was a remarkably precise 
predictor of their subsequent payment behavior when used in conjunction 
with traditional tools like income and credit history. 

A New York Times Magazine article entitled “What Does Your Credit 
Card Company Know about You?” described some of Martin’s most 
intriguing findings: “People who bought cheap, generic automotive oil were 
much more likely to miss a credit-card payment than someone who got the 
expensive, name-brand stuff. People who bought carbon-monoxide 
monitors for their homes or those little felt pads that stop chair legs from 
scratching the floor almost never missed payments. Anyone who purchased 
a chrome-skull car accessory or a ‘Mega Thruster Exhaust System’ was 
pretty likely to miss paying his bill eventually.” 

Probability gives us tools for dealing with life’s uncertainties. You shouldn’t 
play the lottery. You should invest in the stock market if you have a long 
investment horizon (because stocks typically have the best long-term 
returns). You should buy insurance for some things, but not others. 
Probability can even help you maximize your winnings on game shows (as 
the next chapter will show.) 

That said (or written), probability is not deterministic. No, you shouldn’t 
buy a lottery ticket—but you still might win money if you do. And yes, 
probability can help us catch cheaters and criminals—but when used 
inappropriately it can also send innocent people to jail. That’s why we have 
Chapter 6. 


* I have in mind “Six Sigma Man.” The lowercase Greek letter sigma, o, represents the standard 
deviation. Six Sigma Man is six standard deviations above the norm in terms of statistical ability, 
strength, and intelligence. 


t For all of these calculations, I’ve used a handy online binomial calculator, at 
http://stattrek.com/Tables/Binomial.aspx. 

* NASA also pointed out that even falling space debris is government property. Apparently it is 
illegal to keep a satellite souvenir, even if it lands in your backyard. 

t The Levitt and Dubner calculations are as follows. Each year roughly 550 children under ten drown 
and 175 children under ten die from gun accidents. The rates they compare are 1 drowning for every 
11,000 residential pools compared with 1 gun death per “million-plus” guns. For adolescents, I 
suspect the numbers may change sharply, both because they are better able to swim and because they 
are more likely to cause a tragedy if they stumble upon a loaded gun. However, I have not checked 
the data on this point. 

* There are 6 ways to throw a 7 with two dice: (1,6); (2,5); (3,4); (6,1); (5,2); and (4,3). There are 
only 2 ways to throw an 11: (5,6) and (6,5). 

Meanwhile, there are 36 total possible throws with two dice: (1,1); (1,2); (1,3); (1,4); (1,5); (1,6). 
And (2,1); (2,2); (2,3); (2,4); (2,5); (2,6). And (3,1); (3,2); (3,3); (3,4); (3,5); (3,6). And (4,1); (4,2); 

(4.3) ; (4,4); (4,5); (4,6). And (5,1); (5,2); (5,3); (5,4); (5,5); (5,6). And, finally, (6,1); (6,2); (6,3); 

(6.4) ; (6,5); and (6,6). 

Thus, the chance of throwing a 7 or 11 is the number of possible ways of throwing either of those 
two numbers divided by the total number of possible throws with two dice, which is 8/36. 
Incidentally, much of the earlier research on probability was done by gamblers to determine exactly 
this kind of thing. 

* The full expected value for the Illinois Dugout Doubler $1 ticket (rounded to the nearest cent) is as 
follows: 1/15 ($2) + 1/42.86 ($4) + 1/75 ($5) + 1/200 ($10) + 1/300 ($25) + 1/1,589.40 ($50) + 
1/8000 ($100) + 1/16,000 ($200) + 1/48,000 ($500) + 1/40,000 ($1,000) = $.13 + $.09 + $.07 + $.05 
+ $.08 + $.03 + $.01 + $.01 + $.01 + $.03 = $.51. However, there is also a 1/10 chance of getting a 
free ticket, which has an expected payout of $.51, so the overall expected payout is $.51 + .1 ($.51) = 
$.51+ $.05 = $.56. 

* Earlier in the book I used an example that involved drunken employees producing defective laser 
printers. You will need to forget that example here and assume that the company has fixed its quality 
problems. 

* $ince I’ve admonished you to be a stickler about descriptive statistics, I feel compelled to point out 
that the most commonly stolen car is not necessarily the kind of car that is most likely to be stolen. A 
high number of Honda Civics are reported stolen because there are a lot of them on the road; the 
chances that any individual Honda Civic is stolen (which is what car insurance companies care 
about) might be quite low. In contrast, even if 99 percent of all Ferraris are stolen, Ferrari would not 
make the “most commonly stolen” list, because there are not that many of them to steal. 


CHAPTER 5V2 


The Monty Hall Problem 


The “Monty Hall problem” is a famous probability-related conundrum 
faced by participants on the game show Let’s Make a Deal, which 
premiered in the United States in 1963 and is still running in some markets 
around the world. (I remember watching the show whenever I was home 
sick from elementary school.) The program’s gift to statisticians was 
described in the introduction. At the end of each day’s show a contestant 
was invited to stand with host Monty Hall facing three big doors: Door no. 
1, Door no. 2, and Door no. 3. Monty explained to the contestant that there 
was a highly desirable prize behind one of the doors and a goat behind the 
other two doors. The player chose one of the three doors and would get as a 
prize whatever was behind it. (I don’t know if the participants actually got 
to keep the goat; for our purposes, assume that most players preferred the 
new car.) 

The initial probability of winning was straightforward. There were two 
goats and one car. As the participant stood facing the doors with Monty, he 
or she had a 1 in 3 chance of choosing the door that would be opened to 
reveal the car. But as noted earlier. Let’s Make a Deal had a twist, which is 
why the show and its host have been immortalized in the probability 
literature. After the contestant chose a door, Monty would open one of the 
two doors that the contestant had not picked, always revealing a goat. At 
that point, Monty would ask the contestant if he would like to change his 
pick—to switch from the closed door that he had picked originally to the 
other remaining closed door. 

For the sake of example, assume that the contestant has originally chosen 
Door no. 1. Monty would then open Door no. 3; a live goat would be 
standing there on a stage. Two doors would still be closed, nos. 1 and 2. If 
the valuable prize was behind no. 1, the contestant would win; if it was 
behind no. 2, he would lose. That’s when Monty would turn to the player 
and ask whether he would like to change his mind and switch doors, from 


no. 1 to no. 2 in this case. Remember, both doors are still closed. The only 
new information the contestant has received is that a goat showed up behind 
one of the doors that he did not pick. 

Should he switch? 

Yes. The contestant has a 1/3 chance of winning if he sticks with his 
initial choice and a 2/3 chance of winning if he switches. If you don’t 
believe me, read on. 

I’ll concede that this answer seems entirely unintuitive at first. It would 
appear that the contestant has a one-third chance of winning no matter what 
he does. There are three closed doors. At the beginning, each door has a one 
in three chance of holding the valuable prize. How could it matter whether 
he switches from one closed door to another? 

The answer lies in the fact that Monty Hall knows what is behind each 
door. If the contestant picks Door no. 1 and there is a car behind it, then 
Monty can open either no. 2 or no. 3 to display a goat. 

If the contestant picks Door no. 1 and the car is behind no. 2, then Monty 
opens no. 3. 

If the contestant picks Door no. 1 and the car is behind no. 3, then Monty 
opens no. 2. 

By switching after a door is opened, the contestant gets the benefit of 
choosing two doors rather than one. I will try to persuade you in three 
different ways that this analysis is correct. 

The first is empirical. In 2008, New York Times columnist John Tierney 
wrote about the Monty Hall phenomenon. The Times then constructed an 
interactive feature that allows you to play the game yourself, including the 
decision to switch or not. (There are even little goats and cars that pop out 
from behind the doors.) The game keeps track of your success when you 
switch doors after making your initial decision compared with when you do 
not. Try it yourself.* I paid one of my children to play the game 100 times, 
switching each time. I paid her brother to play the game 100 times without 
switching. The switcher won 72 times; the nonswitcher won 33 times. Both 
received two dollars for their efforts. 

The data from episodes of Let’s Make a Deal suggest the same thing. 
According to Leonard Mlodinow, author of The Drunkard’s Walk, those 


contestants who switched their choice won about twice as often as those 
who did not. 2 

My second explanation gets at the intuition. Let’s suppose the rules were 
modified slightly. Assume that the contestant begins by picking one of the 
three doors: no. 1, no. 2, or no. 3, just as the game is ordinarily played. But 
then, before any door is opened to reveal a goat, Monty says, “Would you 
like to give up your choice in exchange for both of the other doors that you 
did not choose ?” So if you picked Door no. 1, you could ditch that door in 
exchange for what is behind no. 2 and no. 3. If you picked no. 3, you could 
switch to no. 1 and no. 2. And so on. 

That would not be a particularly hard decision. Obviously you should 
give up one door in exchange for two, as it increases your chances of 
winning from 1/3 to 2/3. Here is the intriguing part: That is exactly what 
Monty Hall allows you to do in the real game after he reveals the goat. The 
fundamental insight is that if you were to choose two doors, one of them 
would always have a goat behind it anyway. When he opens a door to 
reveal a goat before asking if you’d like to switch, he’s doing you a huge 
favor! He’s saying (in effect), “There is a two-thirds chance that the car is 
behind one of the doors you didn’t choose, and look, it’s not that one!” 

Think of it this way. Suppose you picked Door no. 1. Monty then offers 
you the option to take Doors 2 and 3 instead. You take the offer, giving up 
one door and getting two, meaning that you can reasonably expect to win 
the car 2/3 of the time. At that point, what if Monty were to open Door no. 
3—one of your doors—to reveal a goat? Should you feel less certain about 
your decision? Of course not. If the car were behind no. 3, he would have 
opened no. 2! He’s shown you nothing. 

When the game is played normally, Monty is really giving you a choice 
between the door you originally picked and the other two doors, only one of 
which could possibly have a car behind it. When he opens a door to reveal a 
goat, he’s merely doing you the courtesy of showing you which of the other 
two doors does not have the car. You have the same probability of winning 
in both of the following scenarios: 

1. Choosing Door no. 1, then agreeing to switch to Door no. 2 and Door 
no. 3 before any door is opened. 


2. Choosing Door no. 1, then agreeing to switch to Door no. 2 after 
Monty reveals a goat behind Door no. 3 (or choosing no. 3 after he 
reveals a goat behind no. 2). 

In both cases, switching gives you the benefit of two doors instead of one, 
and you can therefore double your chances of winning, from 1/3 to 2/3. 

My third explanation is a more extreme version of the same basic intuition. 
Assume that Monty Hall offers you a choice from among 100 doors rather 
than just three. After you pick your door, say, no. 47, he opens 98 other 
doors with goats behind them. Now there are only two doors that remain 
closed, no. 47 (your original choice) and one other, say, no. 61. Should you 
switch? 

Of course you should. There is a 99 percent chance that the car was 
behind one of the doors that you did not originally choose. Monty did you 
the favor of opening 98 of those doors that you did not choose, all of which 
he knew did not have the car behind them. There is only a 1 in 100 chance 
that your original pick was correct (no. 47). There is a 99 in 100 chance that 
your original pick was not correct. And if your original pick was not 
correct, then the car is sitting behind the other door, no. 61. If you want to 
win 99 times out of 100, you should switch to no. 61. 

In short, if you ever find yourself as a contestant on Let’s Make a Deal, you 
should definitely switch doors when Monty Hall (or his replacement) gives 
you the option. The more broadly applicable lesson is that your gut instinct 
on probability can sometimes steer you astray. 


* You can play the game at http://www.njnimes.com/2008/04/08/sdence/08monty.html? 
_r=2&oref= slogin&oref=slogin. 


CHAPTER 6 


Problems with Probability 
How overconfident math geeks nearly 
destroyed the global financial system 


Statistics cannot be any smarter than the people who use them. And in 
some cases, they can make smart people do dumb things. One of the most 
irresponsible uses of statistics in recent memory involved the mechanism 
for gauging risk on Wall Street prior to the 2008 financial crisis. At that 
time, firms throughout the financial industry used a common barometer of 
risk, the Value at Risk model, or VaR. In theory, VaR combined the 
elegance of an indicator (collapsing lots of information into a single 
number) with the power of probability (attaching an expected gain or loss to 
each of the firm’s assets or trading positions). The model assumed that there 
is a range of possible outcomes for every one of the firm’s investments. For 
example, if the firm owns General Electric stock, the value of those shares 
can go up or down. When the VaR is being calculated for some short period 
of time, say, one week, the most likely outcome is that the shares will have 
roughly the same value at the end of that stretch as they had at the 
beginning. There is a smaller chance that the shares may rise or fall by 10 
percent. And an even smaller chance that they may rise or fall 25 percent, 
and so on. 

On the basis of past data for market movements, the firm’s quantitative 
experts (often called “quants” in the industry and “rich nerds” everywhere 
else) could assign a dollar figure, say $13 million, that represented the 
maximum that the firm could lose on that position over the time period 
being examined, with 99 percent probability. In other words, 99 times out of 
100 the firm would not lose more than $13 million on a particular trading 
position; 1 time out of 100, it would. 

Remember that last part, because it will soon become important. 


Prior to the financial crisis of 2008, firms trusted the VaR model to 
quantify their overall risk. If a single trader had 923 different open positions 
(investments that could move up or down in value), each of those 
investments could be evaluated as described above for the General Electric 
stock; from there, the trader’s total portfolio risk could be calculated. The 
formula even took into account the correlations among different positions. 
For example, if two investments had expected returns that were negatively 
correlated, a loss in one would likely have been offset by a gain in the other, 
making the two investments together less risky than either one separately. 
Overall, the head of the trading desk would know that bond trader Bob 
Smith has a 24-hour VaR (the value at risk over the next 24 hours) of $19 
million, again with 99 percent probability. The most that Bob Smith could 
lose over the next 24 hours would be $19 million, 99 times out of 100. 

Then, even better, the aggregate risk for the firm could be calculated at 
any point in time by taking the same basic process one step further. The 
underlying mathematical mechanics are obviously fabulously complicated, 
as firms had a dizzying array of investments in different currencies, with 
different amounts of leverage (the amount of money that was borrowed to 
make the investment), trading in markets with different degrees of liquidity, 
and so on. Despite all that, the firm’s managers ostensibly had a precise 
measure of the magnitude of the risk that the firm had taken on at any 
moment in time. As former New York Times business writer Joe Nocera has 
explained, “VaR’s great appeal, and its great selling point to people who do 
not happen to be quants, is that it expresses risk as a single number, a dollar 
figure, no less.” J At J. P. Morgan, where the VaR model was developed and 
refined, the daily VaR calculation was known as the “4:15 report” because it 
would be on the desks of top executives every afternoon at 4:15, just after 
the American financial markets had closed for the day. 

Presumably this was a good thing, as more information is generally 
better, particularly when it comes to risk. After all, probability is a powerful 
tool. Isn’t this just the same kind of calculation that the Schlitz executives 
did before spending a lot of money on blind taste tests at halftime of the 
Super Bowl? 

Not necessarily. VaR has been called “potentially catastrophic,” “a 
fraud,” and many other things not fit for a family book about statistics like 


this one. In particular, the model has been blamed for the onset and severity 
of the financial crisis. The primary critique of VaR is that the underlying 
risks associated with financial markets are not as predictable as a coin flip 
or even a blind taste test between two beers. The false precision embedded 
in the models created a false sense of security. The VaR was like a faulty 
speedometer, which is arguably worse than no speedometer at all. If you 
place too much faith in the broken speedometer, you will be oblivious to 
other signs that your speed is unsafe. In contrast, if there is no speedometer 
at all, you have no choice but to look around for clues as to how fast you 
are really going. 

By around 2005, with the VaR dropping on desks at 4:15 every weekday. 
Wall Street was driving pretty darn fast. Unfortunately, there were two huge 
problems with the risk profiles encapsulated by the VaR models. First, the 
underlying probabilities on which the models were built were based on past 
market movements; however, in financial markets (unlike beer tasting), the 
future does not necessarily look like the past. There was no intellectual 
justification for assuming that the market movements from 1980 to 2005 
were the best predictor of market movements after 2005. In some ways, this 
failure of imagination resembles the military’s periodic mistaken 
assumption that the next war will look like the last one. In the 1990s and 
early 2000s, commercial banks were using lending models for home 
mortgages that assigned zero probability to large declines in housing 
prices. 2 Housing prices had never before fallen as far and as fast as they did 
beginning in 2007. But that’s what happened. Former Federal Reserve 
chairman Alan Greenspan explained to a congressional committee after the 
fact, “The whole intellectual edifice, however, collapsed in the summer of 
[2007] because the data input into the risk management models generally 
covered only the past two decades, a period of euphoria. Had instead the 
models been fitted more appropriately to historic periods of stress, capital 
requirements would have been much higher and the financial world would 
be in far better shape, in my judgment.” 3 

Second, even if the underlying data could accurately predict future risk, 
the 99 percent assurance offered by the VaR model was dangerously 
useless, because it’s the 1 percent that is going to really mess you up. Hedge 
fund manager David Einhorn explained, “This is like an air bag that works 


all the time, except when you have a car accident.” If a firm has a Value at 
Risk of $500 million, that can be interpreted to mean that the firm has a 99 
percent chance of losing no more than $500 million over the time period 
specified. Well, hello, that also means that the firm has a 1 percent chance 
of losing more than $500 million—much, much more under some 
circumstances. In fact, the models had nothing to say about how bad that 1 
percent scenario might turn out to be. Very little attention was devoted to 
the “tail risk,” the small risk (named for the tail of the distribution) of some 
catastrophic outcome. (If you drive home from a bar with a blood alcohol 
level of .15, there is probably less than a 1 percent chance that you will 
crash and die; that does not make it a sensible thing to do.) Many firms 
compounded this error by making unrealistic assumptions about their 
preparedness for rare events. Former treasury secretary Hank Paulson has 
explained that many firms assumed they could raise cash in a pinch by 
selling assets." But during a crisis, every other firm needs cash, too, so all 
are trying to sell the same kinds of assets. It’s the risk management 
equivalent of saying, “I don’t need to stock up on water because if there is a 
natural disaster. I’ll just go to the supermarket and buy some.” Of course, 
after an asteroid hits your town, fifty thousand other people are also trying 
to buy water; by the time you get to the supermarket, the windows are 
broken and the shelves are empty. 

The fact that you’ve never contemplated that your town might be 
flattened by a massive asteroid was exactly the problem with VaR. Here is 
New York Times columnist Joe Nocera again, summarizing thoughts of 
Nicholas Taleb, author of The Black Swan: The Impact of the Highly 
Improbable and a scathing critic of VaR: “The greatest risks are never the 
ones you can see and measure, but the ones you can’t see and therefore can 
never measure. The ones that seem so far outside the boundary of normal 
probability that you can’t imagine they could happen in your lifetime—even 
though, of course, they do happen, more often than you care to realize.” 

In some ways, the VaR debacle is the opposite of the Schlitz example in 
Chapter 5. Schlitz was operating with a known probability distribution. 
Whatever data the company had on the likelihood of blind taste testers’ 
choosing Schlitz was a good estimate of how similar testers would behave 
live at halftime. Schlitz even managed its downside by performing the 


whole test on men who said they liked the other beers better. Even if no 
more than twenty-five Michelob drinkers chose Schlitz (an almost 
impossibly low outcome), Schlitz could still claim that one in four beer 
drinkers ought to consider switching. Perhaps most important, this was all 
just beer, not the global financial system. The Wall Street quants made three 
fundamental errors. First, they confused precision with accuracy. The VaR 
models were just like my golf range finder when it was set to meters instead 
of yards: exact and wrong. The false precision led Wall Street executives to 
believe that they had risk on a leash when in fact they did not. Second, the 
estimates of the underlying probabilities were wrong. As Alan Greenspan 
pointed out in testimony quoted earlier in the chapter, the relatively tranquil 
and prosperous decades before 2005 should not have been used to create 
probability distributions for what might happen in the markets in the 
ensuing decades. This is the equivalent of walking into a casino and 
thinking that you will win at roulette 62 percent of the time because that’s 
what happened last time you went gambling. It would be a long, expensive 
evening. Third, firms neglected their “tail risk.” The VaR models predicted 
what would happen 99 times out of 100. That’s the way probability works 
(as the second half of the book will emphasize repeatedly). Unlikely things 
happen. In fact, over a long enough period of time, they are not even that 
unlikely. People get hit by lightning all the time. My mother has had three 
holes in one. 

The statistical hubris at commercial banks and on Wall Street ultimately 
contributed to the most severe global financial contraction since the Great 
Depression. The crisis that began in 2008 destroyed trillions of dollars in 
wealth in the United States, drove unemployment over 10 percent, created 
waves of home foreclosures and business failures, and saddled governments 
around the world with huge debts as they struggled to contain the economic 
damage. This is a sadly ironic outcome, given that sophisticated tools like 
VaR were designed to mitigate risk. 

Probability offers a powerful and useful set of tools—many of which can be 
employed correctly to understand the world or incorrectly to wreak havoc 
on it. In sticking with the “statistics as a powerful weapon” metaphor that 
I’ve used throughout the book, I will paraphrase the gun rights lobby: 
Probability doesn’t make mistakes; people using probability make mistakes. 



The balance of this chapter will catalog some of the most common 
probability-related errors, misunderstandings, and ethical dilemmas. 

Assuming events are independent when they are not. The probability of 
flipping heads with a fair coin is V 2 . The probability of flipping two heads 
in a row is (V 2 ) 2 , or 14, since the likelihood of two independent events’ both 
happening is the product of their individual probabilities. Now that you are 
armed with this powerful knowledge, let’s assume that you have been 
promoted to head of risk management at a major airline. Your assistant 
informs you that the probability of a jet engine’s failing for any reason 
during a transatlantic flight is 1 in 100,000. Given the number of 
transatlantic flights, this is not an acceptable risk. Fortunately each jet 
making such a trip has at least two engines. Your assistant has calculated 
that the risk of both engines’ shutting down over the Atlantic must be 
(1/100,000) 2 or 1 in 10 billion, which is a reasonable safety risk. This 
would be a good time to tell your assistant to use up his vacation days 
before he is fired. The two engine failures are not independent events. If a 
plane flies through a flock of geese while taking off, both engines are likely 
to be compromised in a similar way. The same would be true of many other 
factors that affect the performance of a jet engine, from weather to improper 
maintenance. If one engine fails, the probability that the second engine fails 
is going to be significantly higher than 1 in 100,000. 

Does this seem obvious? It was not obvious throughout the 1990s as 
British prosecutors committed a grave miscarriage of justice because of an 
improper use of probability. As with the hypothetical jet engine example, 
the statistical mistake was in assuming that several events were independent 
(as in flipping a coin) rather than dependent (when a certain outcome makes 
a similar outcome more likely in the future). This mistake was real, 
however, and innocent people were sent to jail as a result. 

The mistake arose in the context of sudden infant death syndrome 
(SIDS), a phenomenon in which a perfectly healthy infant dies in his or her 
crib. (The Brits refer to SIDS as a “cot death.”) SIDS was a medical 
mystery that attracted more attention as infant deaths from other causes 
became less common. Because these infant deaths were so mysterious and 
poorly understood, they bred suspicion. Sometimes that suspicion was 
warranted. SIDS was used on occasion to cover up parental negligence or 


abuse; a postmortem exam cannot necessarily distinguish natural deaths 
from those in which foul play is involved. British prosecutors and courts 
became convinced that one way to separate foul play from natural deaths 
would be to focus on families in which there were multiple cot deaths. Sir 
Roy Meadow, a prominent British pediatrician, was a frequent expert 
witness on this point. As the British news magazine the Economist explains, 
“What became known as Meadow’s Law—the idea that one infant death is 
a tragedy, two are suspicious and three are murder—is based on the notion 
that if an event is rare, two or more instances of it in the same family are so 
improbable that they are unlikely to be the result of chance.” 5 Sir Meadow 
explained to juries that the chance that a family could have two infants die 
suddenly of natural causes was an extraordinary 1 in 73 million. He 
explained the calculation: Since the incidence of a cot death is rare, 1 in 
8,500, the chance of having two cot deaths in the same family would be 
(1/8,500) 2 which is roughly 1 in 73 million. This reeks of foul play. That’s 
what juries decided, sending many parents to prison on the basis of this 
testimony on the statistics of cot deaths (often without any corroborating 
medical evidence of abuse or neglect). In some cases, infants were taken 
away from their parents at birth because of the unexplained death of a 
sibling. 

The Economist explained how a misunderstanding of statistical 
independence became a flaw in the Meadow testimony: 

There is an obvious flaw in this reasoning, as the Royal Statistical 
Society, protective of its derided subject, has pointed out. The 
probability calculation works fine, so long as it is certain that cot 
deaths are entirely random and not linked by some unknown factor. 
But with something as mysterious as cot deaths, it is quite possible 
that there is a link—something genetic, for instance, which would 
make a family that had suffered one cot death more, not less, likely to 
suffer another. And since those women were convicted, scientists have 
been suggesting that there may be just such a link. 

In 2004, the British government announced that it would review 258 trials 
in which parents had been convicted of murdering their infant children. 


Not understanding when events ARE independent. A different kind of 
mistake occurs when events that are independent are not treated as such. If 
you find yourself in a casino (a place, statistically speaking, that you should 
not go to), you will see people looking longingly at the dice or cards and 
declaring that they are “due.” If the roulette ball has landed on black five 
times in a row, then clearly now it must turn up red. No, no, no! The 
probability of the ball’s landing on a red number remains unchanged: 16/38. 
The belief otherwise is sometimes called “the gambler’s fallacy.” In fact, if 
you flip a fair coin 1,000,000 times and get 1,000,000 heads in a row, the 
probability of getting tails on the next flip is still V 2 . The very definition of 
statistical independence between two events is that the outcome of one has 
no effect on the outcome of the other. Even if you don’t find the statistics 
persuasive, you might ask yourself about the physics: How can flipping a 
series of tails in a row make it more likely that the coin will turn up heads 
on the next flip? 

Even in sports, the notion of streaks may be illusory. One of the most 
famous and interesting probability-related academic papers refutes the 
common notion that basketball players periodically develop a streak of 
good shooting during a game, or “a hot hand.” Certainly most sports fans 
would tell you that a player who makes a shot is more likely to hit the next 
shot than a player who has just missed. Not according to research by 
Thomas Gilovich, Robert Vallone, and Amos Tversky, who tested the hot 
hand in three different ways. 6 First, they analyzed shooting data for the 
Philadelphia 76ers home games during the 1980-81 season. (At the time, 
similar data were not available for other teams in the NBA.) They found 
“no evidence for a positive correlation between the outcomes of successive 
shots.” Second, they did the same thing for free throw data for the Boston 
Celtics, which produced the same result. And last, they did a controlled 
experiment with members of the Cornell men’s and women’s basketball 
teams. The players hit an average of 48 percent of their field goals after 
hitting their last shot and 47 percent after missing. For fourteen of twenty- 
six players, the correlation between making one shot and then making the 
next was negative. Only one player showed a significant positive 
correlation between one shot and the next. 


That’s not what most basketball fans will tell you. For example, 91 
percent of basketball fans surveyed at Stanford and Cornell by the authors 
of the paper agreed with the statement that a player has a better chance of 
making his next shot after making his last two or three shots than he does 
after missing his last two or three shots. The significance of the “hot hand” 
paper lies in the difference between the perception and the empirical reality. 
The authors note, “People’s intuitive conceptions of randomness depart 
systematically from the laws of chance.” We see patterns where none may 
really exist. 

Like cancer clusters. 

Clusters happen. You’ve probably read the story in the newspaper, or 
perhaps seen the news expose: Some statistically unlikely number of people 
in a particular area have contracted a rare form of cancer. It must be the 
water, or the local power plant, or the cell phone tower. Of course, any one 
of those things might really be causing adverse health outcomes. (Later 
chapters will explore how statistics can identify such causal relationships.) 
But this cluster of cases may also be the product of pure chance, even when 
the number of cases appears highly improbable. Yes, the probability that 
five people in the same school or church or workplace will contract the 
same rare form of leukemia may be one in a mill ion, but there are millions 
of schools and churches and workplaces. It’s not highly improbable that 
five people might get the same rare form of leukemia in one of those places. 
We just aren’t thinking about all the schools and churches and workplaces 
where this hasn’t happened. To use a different variation on the same basic 
example, the chance of winning the lotto may be 1 in 20 mill ion, but none 
of us is surprised when someone wins, because millions of tickets have been 
sold. (Despite my general aversion to lotteries, I do admire the Illinois 
slogan: “Someone’s gonna Lotto, might as well be you.”) 

Here is an exercise that I do with my students to make the same basic 
point. The larger the class, the better it works. I ask everyone in the class to 
take out a coin and stand up. We all flip the coin; anyone who flips heads 
must sit down. Assuming we start with 100 students, roughly 50 will sit 
down after the first flip. Then we do it again, after which 25 or so are still 
standing. And so on. More often than not, there will be a student standing at 
the end who has flipped five or six tails in a row. At that point, I ask the 



student questions like “How did you do it?” and “What are the best training 
exercises for flipping so many tails in a row?” or “Is there a special diet that 
helped you pull off this impressive accomplishment?” These questions elicit 
laughter because the class has just watched the whole process unfold; they 
know that the student who flipped six tails in a row has no special coin¬ 
flipping talent. He or she just happened to be the one who ended up with a 
lot of tails. When we see an anomalous event like that out of context, 
however, we assume that something besides randomness must be 
responsible. 

The prosecutor's fallacy. Suppose you hear testimony in court to the 
following effect: (1) a DNA sample found at the scene of a crime matches a 
sample taken from the defendant; and (2) there is only one chance in a 
million that the sample recovered at the scene of the crime would match 
anyone’s besides the defendant. (For the sake of this example, you can 
assume that the prosecution’s probabilities are correct.) On the basis of that 
evidence, would you vote to convict? 

I sure hope not. 

The prosecutor’s fallacy occurs when the context surrounding statistical 
evidence is neglected. Here are two scenarios, each of which could explain 
the DNA evidence being used to prosecute the defendant. 

Defendant 1 : This defendant, a spurned lover of the victim, was arrested 
three blocks from the crime scene carrying the murder weapon. After he 
was arrested, the court compelled him to offer a DNA sample, which 
matched a sample taken from a hair found at the scene of the crime. 

Defendant 2: This defendant was convicted of a similar crime in a 
different state several years ago. As a result of that conviction, his DNA 
was included in a national DNA database of over a million violent felons. 
The DNA sample taken from the hair found at the scene of the crime was 
run through that database and matched to this individual, who has no known 
association with the victim. 

As noted above, in both cases the prosecutor can rightfully say that the 
DNA sample taken from the crime scene matches the defendant’s and that 
there is only a one in a million chance that it would match with anyone 
else’s. But in the case of Defendant 2, there is a darn good chance that he 
could be that random someone else, the one in a million guy whose DNA 



just happens to be similar to the real killer’s by chance. Because the 
chances of finding a coincidental one in a million match are relatively high 
if you run the sample through a database with samples from a million 
people. 

Reversion to the mean (or regression to the mean). Perhaps you’ve heard 
of the Sports Illustrated jinx, whereby individual athletes or teams featured 
on the cover of Sports Illustrated subsequently see their performance fall 
off. One explanation is that being on the cover of the magazine has some 
adverse effect on subsequent performance. The more statistically sound 
explanation is that teams and athletes appear on its cover after some 
anomalously good stretch (such as a twenty-game winning streak) and that 
their subsequent performance merely reverts back to what is normal, or the 
mean. This is the phenomenon known as reversion to the mean. Probability 
tells us that any outlier—an observation that is particularly far from the 
mean in one direction or the other—is likely to be followed by outcomes 
that are more consistent with the long-term average. 

Reversion to the mean can explain why the Chicago Cubs always seem 
to pay huge salaries for free agents who subsequently disappoint fans like 
me. Players are able to negotiate huge salaries with the Cubs after an 
exceptional season or two. Putting on a Cubs uniform does not necessarily 
make these players worse (though I would not necessarily rule that out); 
rather, the Cubs pay big bucks for these superstars at the end of some 
exceptional stretch—an outlier year or two—after which their performance 
for the Cubs reverts to something closer to normal. 

The same phenomenon can explain why students who do much better 
than they normally do on some kind of test will, on average, do slightly 
worse on a retest, and students who have done worse than usual will tend to 
do slightly better when retested. One way to think about this mean reversion 
is that performance—both mental and physical—consists of some 
underlying talent-related effort plus an element of luck, good or bad. 
(Statisticians would call this random error.) In any case, those individuals 
who perform far above the mean for some stretch are likely to have had 
luck on their side; those who perform far below the mean are likely to have 
had bad luck. (In the case of an exam, think about students guessing right or 
wrong; in the case of a baseball player, think about a hit that can either go 



foul or land one foot fair for a triple.) When a spell of very good luck or 
very bad luck ends—as it inevitably will—the resulting performance will be 
closer to the mean. 

Imagine that I am trying to assemble a superstar coin-flipping team 
(under the erroneous impression that talent matters when it comes to coin 
flipping). After I observe a student flipping six tails in a row, I offer him a 
ten-year, $50 million contract. Needless to say. I’m going to be 
disappointed when this student flips only 50 percent tails over those ten 
years. 

At first glance, reversion to the mean may appear to be at odds with the 
“gambler’s fallacy.” After the student throws six tails in a row, is he “due” 
to throw heads or not? The probability that he throws heads on the next flip 
is the same as it always is: V 2 . The fact that he has thrown lots of tails in a 
row does not make heads more likely on the next flip. Each flip is an 
independent event. However, we can expect the results of the ensuing flips 
to be consistent with what probability predicts, which is half heads and half 
tails, rather than what it has been in the past, which is all tails. It’s a virtual 
certainty that someone who has flipped all tails will begin throwing more 
heads in the ensuing 10, 20, or 100 flips. And the more flips, the more 
closely the outcome will resemble the 50-50 mean outcome that the law of 
large numbers predicts. (Or, alternatively, we should start looking for 
evidence of fraud.) 

As a curious side note, researchers have also documented a Businessweek 
phenomenon. When CEOs receive high-profile awards, including being 
named one of Businessweek’s “Best Managers,” their companies 
subsequently underperform over the next three years as measured by both 
accounting profits and stock price. However, unlike the Sports Illustrated 
effect, this effect appears to be more than reversion to the mean. According 
to Ulrike Malmendier and Geoffrey Tate, economists at the University of 
California at Berkeley and UCLA, respectively, when CEOs achieve 
“superstar” status, they get distracted by their new prominence. They write 
their memoirs. They are invited to sit on outside boards. They begin 
searching for trophy spouses. (The authors propose only the first two 
explanations, but I find the last one plausible as well.) Malmendier and Tate 
write, “Our results suggest that media-induced superstar culture leads to 


behavioral distortions beyond mere mean reversion.” In other words, when 
a CEO appears on the cover of Businessweek, sell the stock. 

Statistical discrimination. When is it okay to act on the basis of what 
probability tells us is likely to happen, and when is it not okay? In 2003, 
Anna Diamantopoulou, the European commissioner for employment and 
social affairs, proposed a directive declaring that insurance companies may 
not charge different rates to men and women, because it violates the 
European Union’s principle of equal treatment; To insurers, however, 
gender-based premiums aren’t discrimination; they’re just statistics. Men 
typically pay more for auto insurance because they crash more. Women pay 
more for annuities (a financial product that pays a fixed monthly or yearly 
sum until death) because they live longer. Obviously many women crash 
more than many men, and many men live longer than many women. But, as 
explained in the last chapter, insurance companies don’t care about that. 
They care only about what happens on average, because if they get that 
right, the firm will make money. The interesting thing about the European 
Commission policy banning gender-based insurance premiums, which is 
being implemented in 2012, is that the authorities are not pretending that 
gender is unrelated to the risks being insured; they are simply declaring that 
disparate rates based on sex are unacceptable. 

At first, that feels like an annoying nod to political correctness. Upon 
reflection. I’m not so sure. Remember all that impressive stuff about 
preventing crimes before they happen? Probability can lead us to some 
intriguing but distressing places in this regard. How should we react when 
our probability-based models tell us that methamphetamine smugglers from 
Mexico are most likely to be Hispanic men aged between eighteen and 
thirty and driving red pickup trucks between 9:00 p.m. and midnight when 
we also know that the vast majority of Hispanic men who fit that profile are 
not smuggling methamphetamine? Yep, I used the profiling word, because 
that’s the less glamorous description of the predictive analytics that I 
described so glowingly in the last chapter, or at least one potential aspect of 
it. 

Probability tells us what is more likely and what is less likely. Yes, that is 
just basic statistics—the tools described over the last few chapters. But it is 
also statistics with social implications. If we want to catch violent criminals 


and terrorists and drug smugglers and other individuals with the potential to 
do enormous harm, then we ought to use every tool at our disposal. 
Probability can be one of those tools. It would be naive to think that gender, 
age, race, ethnicity, religion, and country of origin collectively tell us 
nothing about anything related to law enforcement. 

But what we can or should do with that kind of information (assuming it 
has some predictive value) is a philosophical and legal question, not a 
statistical one. We’re getting more and more information every day about 
more and more things. Is it okay to discriminate if the data tell us that we’ll 
be right far more often than we’re wrong? (This is the origin of the term 
“statistical discrimination,” or “rational discrimination.”) The same kind of 
analysis that can be used to determine that people who buy birdseed are less 
likely to default on their credit cards (yes, that’s really true) can be applied 
everywhere else in life. How much of that is acceptable? If we can build a 
model that identifies drug smugglers correctly 80 out of 100 times, what 
happens to the poor souls in the 20 percent —because our model is going to 
harass them over and over and over again. 

The broader point here is that our ability to analyze data has grown far 
more sophisticated than our thinking about what we ought to do with the 
results. You can agree or disagree with the European Commission decision 
to ban gender-based insurance premiums, but I promise you it will not be 
the last tricky decision of that sort. We like to think of numbers as “cold, 
hard facts.” If we do the calculations right, then we must have the right 
answer. The more interesting and dangerous reality is that we can 
sometimes do the calculations correctly and end up blundering in a 
dangerous direction. We can blow up the financial system or harass a 
twenty-two-year-old white guy standing on a particular street corner at a 
particular time of day, because, according to our statistical model, he is 
almost certainly there to buy drugs. For all the elegance and precision of 
probability, there is no substitute for thinking about what calculations we 
are doing and why we are doing them. 


* SIDS is still a medical mystery, though many of the risk factors have been identified. For example, 
infant deaths can be reduced sharply by putting babies to sleep on their backs. 

* The policy change was ultimately precipitated by a 2011 ruling by the Court of Justice of the 
European Union that different premiums for men and women constitute sex discrimination. 


CHAPTER 7 


The Importance of Data 
“Garbage in, garbage out ” 


In the spring of 2012, researchers published a striking finding in the 
esteemed journal Science. According to this cutting-edge research, when 
male fruit flies are spurned repeatedly by female fruit flies, they drown their 
sorrows in alcohol. The New York Times described the study in a front page 
article: “They were young males on the make, and they struck out not once, 
not twice, but a dozen times with a group of attractive females hovering 
nearby. So they did what so many men do after being repeatedly rejected: 
they got drunk, using alcohol as a balm for unfulfilled desire.” 1 

This research advances our understanding of the brain’s reward system, 
which in turn can help us find new strategies for dealing with drug and 
alcohol dependence. A substance abuse expert described reading the study 
as “looking back in time, to see the very origins of the reward circuit that 
drives fundamental behaviors like sex, eating and sleeping.” 

Since I am not an expert in this field, I had two slightly different 
reactions upon reading about spurned fruit flies. First, it made me nostalgic 
for college. Second, my inner researcher got to wondering how fruit flies 
get drunk. Is there a miniature fruit fly bar, with assorted fruit based liquors 
and an empathetic fruit fly bartender? Is country western music playing in 
the background? Do fruit flies even like country western music? 

It turns out that the design of the experiment was devilishly simple. One 
group of male fruit flies was allowed to mate freely with virgin females. 
Another group of males was released among female fruit flies that had 
already mated and were therefore indifferent to the males’ amorous 
overtures. Both sets of male fruit flies were then offered feeding straws that 
offered a choice between standard fruit fly fare, yeast and sugar, and the 
“hard stuff”: yeast, sugar, and 15 percent alcohol. The males who had spent 


days trying to mate with indifferent females were significantly more likely 
to hit the booze. 

The levity notwithstanding, these results have important implications for 
humans. They suggest a connection between stress, chemical responses in 
the brain, and an appetite for alcohol. However, the results are not a triumph 
of statistics. They are a triumph of data, which made relatively basic 
statistical analysis possible. The genius of this study was figuring out a way 
to create a group of sexually satiated male fruit flies and a group of sexually 
frustrated male fruit flies—and then to find a way to compare their drinking 
habits. Once the researchers did that, the number crunching wasn’t any 
more complicated than that of a typical high school science fair project. 

Data are to statistics what a good offensive line is to a star quarterback. 
In front of every star quarterback is a good group of blockers. They usually 
don’t get much credit. But without them, you won’t ever see a star 
quarterback. Most statistics books assume that you are using good data, just 
as a cookbook assumes that you are not buying rancid meat and rotten 
vegetables. But even the finest recipe isn’t going to salvage a meal that 
begins with spoiled ingredients. So it is with statistics; no amount of fancy 
analysis can make up for fundamentally flawed data. Hence the expression 
“garbage in, garbage out.” Data deserve respect, just like offensive linemen. 

We generally ask our data to do one of three things. First, we may demand a 
data sample that is representative of some larger group or population. If we 
are trying to gauge voters’ attitudes toward a particular political candidate, 
we will need to interview a sample of prospective voters who are 
representative of all voters in the relevant political jurisdiction. (And 
remember, we don’t want a sample that is representative of everyone living 
in that jurisdiction; we want a sample of those who are likely to vote.) One 
of the most powerful findings in statistics, which will be explained in 
greater depth over the next two chapters, is that inferences made from 
reasonably large, properly drawn samples can be every bit as accurate as 
attempting to elicit the same information from the entire population. 

The easiest way to gather a representative sample of a larger population 
is to select some subset of that population randomly. (Shockingly, this is 
known as a simple random sample.) The key to this methodology is that 
each observation in the relevant population must have an equal chance of 



being included in the sample. If you plan to survey a random sample of 100 
adults in a neighborhood with 4,328 adult residents, your methodology has 
to ensure that each of those 4,328 residents has the same probability of 
ending up as one of the 100 adults who are surveyed. Statistics books 
almost always illustrate this point by drawing colored marbles out of an urn. 
(In fact, it’s about the only place where one sees the word “urn” used with 
any regularity.) If there are 60,000 blue marbles and 40,000 red marbles in a 
giant urn, then the most likely composition of a sample of 100 marbles 
drawn randomly from the urn would be 60 blue marbles and 40 red 
marbles. If we did this more than once, there would obviously be deviations 
from sample to sample—some might have 62 blue marbles and 38 red 
marbles, or 58 blue and 42 red. But the chances of drawing any random 
sample that deviates hugely from the composition of marbles in the urn are 
very, very low. 

Now, admittedly, there are some practical challenges here. Most 
populations we care about tend to be more complicated than an urn full of 
marbles. How, exactly, would one select a random sample of the American 
adult population to be included in a telephone poll? Even a seemingly 
elegant solution like a telephone random dialer has potential flaws. Some 
individuals (particularly low-income persons) may not have a telephone. 
Others (particularly high-income persons) may be more prone to screen 
calls and choose not to answer. Chapter 10 will outline some of the 
strategies that polling firms use to surmount these kinds of sampling 
challenges (most of which got even more complicated with the advent of 
cell phones). The key idea is that a properly drawn sample will look like the 
population from which it is drawn. In terms of intuition, you can envision 
sampling a pot of soup with a single spoonful. If you’ve stirred your soup 
adequately, a single spoonful can tell you how the whole pot tastes. 

A statistics text will include far more detail on sampling methods. 
Polling firms and market research companies spend their days figuring out 
how to get good representative data from various populations in the most 
cost-effective way. For now, you should appreciate several important 
things: (1) A representative sample is a fabulously important thing, for it 
opens the door to some of the most powerful tools that statistics has to offer. 
(2) Getting a good sample is harder than it looks. (3) Many of the most 
egregious statistical assertions are caused by good statistical methods 



applied to bad samples, not the opposite. (4) Size matters, and bigger is 
better. The details will be explained in the coming chapters, but it should be 
intuitive that a larger sample will help to smooth away any freak variation. 
(A bowl of soup will be an even better test than a spoonful.) One crucial 
caveat is that a bigger sample will not make up for errors in its composition, 
or “bias.” A bad sample is a bad sample. No supercomputer or fancy 
formula is going to rescue the validity of your national presidential poll if 
the respondents are drawn only from a telephone survey of Washington, 
D.C., residents. The residents of Washington, D.C., don’t vote like the rest 
of America; calling 100,000 D.C. residents rather than 1,000 is not going to 
fix that fundamental problem with your poll. In fact, a large, biased sample 
is arguably worse than a small, biased sample because it will give a false 
sense of confidence regarding the results. 

The second thing we often ask of data is that they provide some source of 
comparison. Is a new medicine more effective than the current treatment? 
Are ex-convicts who receive job training less likely to return to prison than 
ex-convicts who do not receive such training? Do students who attend 
charter schools perform better than similar students who attend regular 
public schools? 

In these cases, the goal is to find two groups of subjects who are broadly 
similar except for the application of whatever “treatment” we care about. In 
a social science context, the word “treatment” is broad enough to 
encompass anything from being a sexually frustrated fruit fly to receiving 
an income tax rebate. As with any other application of the scientific 
method, we are trying to isolate the impact of one specific intervention or 
attribute. This was the genius of the fruit fly experiment. The researchers 
figured out a way to create a control group (the males who mated) and a 
“treatment” group (the males who were shot down); the subsequent 
difference in their drinking behaviors can then be attributed to whether they 
were sexually spurned or not. 

In the physical and biological sciences, creating treatment and control 
groups is relatively straightforward. Chemists can make small variations 
from test tube to test tube and then study the difference in outcomes. 
Biologists can do the same thing with their petri dishes. Even most animal 
testing is simpler than trying to get fruit flies to drink alcohol. We can have 



one group of rats exercise regularly on a treadmill and then compare their 
mental acuity in a maze with the performance of another group of rats that 
didn’t exercise. But when humans become involved, things grow more 
complicated. Sound statistical analysis often requires a treatment and a 
control group, yet we cannot force people to do the things that we make 
laboratory rats do. (And many people do not like making even the lab rats 
do these things.) Do repeated concussions cause serious neurological 
problems later in life? This is a really important question. The future of 
football (and perhaps other sports) hangs on the answer. Yet it is a question 
that cannot be answered with experiments on humans. So unless and until 
we can teach fruit flies to wear helmets and run the spread offense, we have 
to find other ways to study the long-term impact of head trauma. 

One recurring research challenge with human subjects is creating 
treatment and control groups that differ only in that one group is getting the 
treatment and the other is not. For this reason, the “gold standard” of 
research is randomization, a process by which human subjects (or schools, 
or hospitals, or whatever we’re studying) are randomly assigned to either 
the treatment or the control group. We do not assume that all the 
experimental subjects are identical. Instead, probability becomes our friend 
(once again), and we assume that randomization will evenly divide all 
relevant characteristics between the two groups—both the characteristics 
we can observe, like race or income, but also confounding characteristics 
that we cannot measure or had not considered, such as perseverance or 
faith. 

The third reason we collect data is, to quote my teenage daughter, “Just 
because.” We sometimes have no specific idea what we will do with the 
information—but we suspect it will come in handy at some point. This is 
similar to a crime scene detective who demands that all possible evidence 
be captured so that it can be sorted later for clues. Some of this evidence 
will prove useful, some will not. If we knew exactly what would be useful, 
we probably would not need to be doing the investigation in the first place. 

You probably know that smoking and obesity are risk factors for heart 
disease. You probably don’t know that a long-running study of the residents 
of Framingham, Massachusetts, helped to clarify those relationships. 
Framingham is a suburban town of some 67,000 people about twenty miles 



west of Boston. To nonresearchers, it is best known as a suburb of Boston 
with reasonably priced housing and convenient access to the impressive and 
upscale Natick Mall. To researchers, Framingham is best known as the 
home of the Framingham Heart Study, one of the most successful and 
influential longitudinal studies in the history of modern science. 

A longitudinal study collects information on a large group of subjects at 
many different points in time, such as once every two years. The same 
participants may be interviewed periodically for ten, twenty, or even fifty 
years after they enter the study, creating a remarkably rich trove of 
information. In the case of the Framingham study, researchers gathered 
information on 5,209 adult residents of Framingham in 1948: height, 
weight, blood pressure, educational background, family structure, diet, 
smoking behavior, drug use, and so on. Most important, researchers have 
gathered follow-up data from the same participants ever since (and also 
data on their offspring, to examine genetic factors related to heart disease). 
The Framingham data have been used to produce over two thousand 
academic articles since 1950, including nearly a thousand between 2000 
and 2009. 

These studies have produced findings crucial to our understanding of 
cardiovascular disease, many of which we now take for granted: cigarette 
smoking increases the risk of heart disease (1960); physical activity reduces 
the risk of heart disease and obesity increases it (1967); high blood pressure 
increases the risk of stroke (1970); high levels of HDL cholesterol 
(henceforth known as the “good cholesterol”) reduce the risk of death 
(1988); individuals with parents and siblings who have cardiovascular 
disease are at significantly higher risk of the same (2004 and 2005). 

Longitudinal data sets are the research equivalent of a Ferrari. The data 
are particularly valuable when it comes to exploring causal relationships 
that may take years or decades to unfold. For example, the Perry Preschool 
Study began in the late 1960s with a group of 123 African American three- 
and four-year-olds from poor families. The participating children were 
randomly assigned into a group that received an intensive preschool 
program and a comparison group that did not. Researchers then measured 
various outcomes for both groups for the next forty years. The results make 
a compelling case for the benefits of early childhood education. The 
students who received the intensive preschool experience had higher IQs at 



age five. They were more likely to graduate from high school. They had 
higher earnings at age forty. In contrast, the participants who did not receive 
the preschool program were significantly more likely to have been arrested 
five or more times by age forty. 

Not surprisingly, we can’t always have the Ferrari. The research 
equivalent of a Toyota is a cross-sectional data set, which is a collection of 
data gathered at a single point in time. For example, if epidemiologists are 
searching for the cause of a new disease (or an outbreak of an old one), they 
may gather data from all those afflicted in hopes of finding a pattern that 
leads to the source. What have they eaten? Where have they traveled? What 
else do they have in common? Researchers may also gather data from 
individuals who are not afflicted by the disease to highlight contrasts 
between the two groups. 

In fact, all of this exciting cross-sectional data talk reminds me of the 
week before my wedding, when I became part of a data set. I was working 
in Kathmandu, Nepal, when I tested positive for a poorly understood 
stomach illness called “blue-green algae,” which had been found in only 
two places in the world. Researchers had isolated the pathogen that caused 
the disease, but they were not yet sure what kind of organism it was, as it 
had never been identified before. When I called home to inform my fiancee 
about my diagnosis, I acknowledged that there was some bad news. The 
disease had no known means of transmission, no known cure, and could 
cause extreme fatigue and other unpleasant side effects for anywhere from a 
few days to many months. With the wedding only one week away, yes, this 
could be a problem. Would I have complete control of my digestive system 
as I walked down the aisle? Maybe. 

But then I really tried to focus on the good news. First, “blue-green 
algae” was thought to be nonfatal. And second, experts in tropical diseases 
from as far away as Bangkok had taken a personal interest in my case. How 
cool is that? (Also, I did a terrific job of repeatedly steering the discussion 
back to the wedding planning: “Enough about my incurable disease. Tell me 
more about the flowers.”) 

I spent my final hours in Kathmandu filling out a thirty-page survey 
describing every aspect of my life: Where did I eat? What did I eat? How 
did I cook? Did I go swimming? Where and how often? Everyone else who 


had been diagnosed with the disease was doing the same thing. Eventually 
the pathogen was identified as a water-borne form of cyanobacteria. (These 
bacteria are blue, and they are the only kind of bacteria that get their energy 
from photosynthesis; hence the original description of the disease as “blue- 
green algae.”) The illness was found to respond to treatment with traditional 
antibiotics, but, curiously, not to some of the newer ones. All of these 
discoveries were too late to help me, but I was lucky enough to recover 
quickly anyway. I had near-perfect control of my digestive system by 
wedding day. 

Behind every important study there are good data that made the analysis 
possible. And behind every bad study . . . well, read on. People often speak 
about “lying with statistics.” I would argue that some of the most egregious 
statistical mistakes involve lying with data; the statistical analysis is fine, 
but the data on which the calculations are performed are bogus or 
inappropriate. Here are some common examples of “garbage in, garbage 
out.” 

Selection bias. Pauline Kael, the longtime film critic for The New Yorker, is 
alleged to have said after Richard Nixon’s election as president, “Nixon 
couldn’t have won. I don’t know anyone who voted for him.” The quotation 
is most likely apocryphal, but it’s a lovely example of how a lousy sample 
(one’s group of liberal friends) can offer a misleading snapshot of a larger 
population (voters from across America). And it introduces the question one 
should always ask: How have we chosen the sample or samples that we are 
evaluating? If each member of the relevant population does not have an 
equal chance of ending up in the sample, we are going to have a problem 
with whatever results emerge from that sample. One ritual of presidential 
politics is the Iowa straw poll, in which Republican candidates descend on 
Ames, Iowa, in August of the year before a presidential election to woo 
participants, each of whom pays $30 to cast a vote in the poll. The Iowa 
straw poll does not tell us that much about the future of Republican 
candidates. (The poll has predicted only three of the last five Republican 
nominees.) Why? Because Iowans who pay $30 to vote in the straw poll are 
different from other Iowa Republicans; and Iowa Republicans are different 
from Republican voters in the rest of the country. 



Selection bias can be introduced in many other ways. A survey of 
consumers in an airport is going to be biased by the fact that people who fly 
are likely to be wealthier than the general public; a survey at a rest stop on 
Interstate 90 may have the opposite problem. Both surveys are likely to be 
biased by the fact that people who are willing to answer a survey in a public 
place are different from people who would prefer not to be bothered. If you 
ask 100 people in a public place to complete a short survey, and 60 are 
willing to answer your questions, those 60 are likely to be different in 
significant ways from the 40 who walked by without making eye contact. 

One of the most famous statistical blunders of all time, the notorious 
Literary Digest poll of 1936, was caused by a biased sample. In that year, 
Kansas governor Alf Landon, a Republican, was running for president 
against incumbent Franklin Roosevelt, a Democrat. Literary Digest, an 
influential weekly news magazine at the time, mailed a poll to its 
subscribers and to automobile and telephone owners whose addresses could 
be culled from public records. All told, the Literary Digest poll included 10 
million prospective voters, which is an astronomically large sample. As 
polls with good samples get larger, they get better, since the margin of error 
shrinks. As polls with bad samples get larger, the pile of garbage just gets 
bigger and smellier. Literary Digest predicted that Landon would beat 
Roosevelt with 57 percent of the popular vote. In fact, Roosevelt won in a 
landslide, with 60 percent of the popular vote and forty-six of forty-eight 
states in the electoral college. The Literary Digest sample was “garbage in”: 
the magazine’s subscribers were wealthier than average Americans, and 
therefore more likely to vote Republican, as were households with 
telephones and cars in 1936. 2 

We can end up with the same basic problem when we compare outcomes 
between a treatment and a control group if the mechanism for sorting 
individuals into one group or the other is not random. Consider a recent 
finding in the medical literature on the side effects of treatment for prostate 
cancer. There are three common treatments for prostate cancer: surgical 
removal of the prostate; radiation therapy; or brachytherapy (which 
involves implanting radioactive “seeds” near the cancer). 3 Impotence is a 
common side effect of prostate cancer treatment, so researchers have 
documented the sexual function of men who receive each of the three 


treatments. A study of 1,000 men found that two years after treatment, 35 
percent of the men in the surgery group were able to have sexual 
intercourse, compared with 37 percent in the radiation group and 43 percent 
in the brachytherapy group. 

Can one look at these data and assume that brachytherapy is least likely 
to damage a man’s sexual function? No, no, no. The authors of the study 
explicitly warn that we cannot conclude that brachytherapy is better at 
preserving sexual function, since the men who receive this treatment are 
generally younger and fitter than men who receive the other treatment. The 
purpose of the study was merely to document the degree of sexual side 
effects across all types of treatment. 

A related source of bias, known as self-selection bias, will arise 
whenever individuals volunteer to be in a treatment group. For example, 
prisoners who volunteer for a drug treatment group are different from other 
prisoners because they have volunteered to be in a drug treatment program. 
If the participants in this program are more likely to stay out of prison after 
release than other prisoners, that’s great—but it tells us absolutely nothing 
about the value of the drug treatment program. These former inmates may 
have changed their lives because the program helped them kick drugs. Or 
they may have changed their lives because of other factors that also 
happened to make them more likely to volunteer for a drug treatment 
program (such as having a really strong desire not to go back to prison). We 
cannot separate the causal impact of one (the drug treatment program) from 
the other (being the kind of person who volunteers for a drug treatment 
program). 

Publication bias. Positive findings are more likely to be published than 
negative findings, which can skew the results that we see. Suppose you 
have just conducted a rigorous, longitudinal study in which you find 
conclusively that playing video games does not prevent colon cancer. 
You’ve followed a representative sample of 100,000 Americans for twenty 
years; those participants who spend hours playing video games have 
roughly the same incidence of colon cancer as the participants who do not 
play video games at all. We’ll assume your methodology is impeccable. 
Which prestigious medical journal is going to publish your results? 



None, for two reasons. First, there is no strong scientific reason to 
believe that playing video games has any impact on colon cancer, so it is 
not obvious why you were doing this study. Second, and more relevant 
here, the fact that something does not prevent cancer is not a particularly 
interesting finding. After all, most things don’t prevent cancer. Negative 
findings are not especially sexy, in medicine or elsewhere. 

The net effect is to distort the research that we see, or do not see. 
Suppose that one of your graduate school classmates has conducted a 
different longitudinal study. She finds that people who spend a lot of time 
playing video games do have a lower incidence of colon cancer. Now that is 
interesting! That is exactly the kind of finding that would catch the attention 
of a medical journal, the popular press, bloggers, and video game makers 
(who would slap labels on their products extolling the health benefits of 
their products). It wouldn’t be long before Tiger Moms all over the country 
were “protecting” their children from cancer by snatching books out of their 
hands and forcing them to play video games instead. 

Of course, one important recurring idea in statistics is that unusual things 
happen every once in a while, just as a matter of chance. If you conduct 100 
studies, one of them is likely to turn up results that are pure nonsense—like 
a statistical association between playing video games and a lower incidence 
of colon cancer. Here is the problem: The 99 studies that find no link 
between video games and colon cancer will not get published, because they 
are not very interesting. The one study that does find a statistical link will 
make it into print and get loads of follow-on attention. The source of the 
bias stems not from the studies themselves but from the skewed information 
that actually reaches the public. Someone reading the scientific literature on 
video games and cancer would find only a single study, and that single 
study will suggest that playing video games can prevent cancer. In fact, 99 
studies out of 100 would have found no such link. 

Yes, my example is absurd—but the problem is real and serious. Here is 
the first sentence of a New York Times article on the publication bias 
surrounding drugs for treating depression: “The makers of antidepressants 
like Prozac and Paxil never published the results of about a third of the drug 
trials that they conducted to win government approval, misleading doctors 
and consumers about the drugs’ true effectiveness.” 4 It turns out that 94 


percent of studies with positive findings on the effectiveness of these drugs 
were published, while only 14 percent of the studies with nonpositive 
results were published. For patients dealing with depression, this is a big 
deal. When all the studies are included, the antidepressants are better than a 
placebo by only “a modest margin.” 

To combat this problem, medical journals now typically require that any 
study be registered at the beginning of the project if it is to be eligible for 
publication later on. This gives the editors some evidence on the ratio of 
positive to nonpositive findings. If 100 studies are registered that propose to 
examine the effect of skateboarding on heart disease, and only one is 
ultimately submitted for publication with positive findings, the editors can 
infer that the other studies had nonpositive findings (or they can at least 
investigate this possibility). 

Recall bias. Memory is a fascinating thing—though not always a great 
source of good data. We have a natural human impulse to understand the 
present as a logical consequence of things that happened in the past—cause 
and effect. The problem is that our memories turn out to be “systematically 
fragile” when we are trying to explain some particularly good or bad 
outcome in the present. Consider a study looking at the relationship 
between diet and cancer. In 1993, a Harvard researcher compiled a data set 
comprising a group of women with breast cancer and an age-matched group 
of women who had not been diagnosed with cancer. Women in both groups 
were asked about their dietary habits earlier in life. The study produced 
clear results: The women with breast cancer were significantly more likely 
to have had diets that were high in fat when they were younger. 

Ah, but this wasn’t actually a study of how diet affects the likelihood of 
getting cancer. This was a study of how getting cancer affects a woman's 
memory of her diet earlier in life. All of the women in the study had 
completed a dietary survey years earlier, before any of them had been 
diagnosed with cancer. The striking finding was that women with breast 
cancer recalled a diet that was much higher in fat than what they actually 
consumed; the women with no cancer did not. The New York Times 
Magazine described the insidious nature of this recall bias: 



The diagnosis of breast cancer had not just changed a woman’s present 
and the future; it had altered her past. Women with breast cancer had 
(unconsciously) decided that a higher-fat diet was a likely 
predisposition for their disease and (unconsciously) recalled a high-fat 
diet. It was a pattern poignantly familiar to anyone who knows the 
history of this stigmatized illness: these women, like thousands of 
women before them, had searched their own memories for a cause and 
then summoned that cause into memory. 5 

Recall bias is one reason that longitudinal studies are often preferred to 
cross-sectional studies. In a longitudinal study the data are collected 
contemporaneously. At age five, a participant can be asked about his 
attitudes toward school. Then, thirteen years later, we can revisit that same 
participant and determine whether he has dropped out of high school. In a 
cross-sectional study, in which all the data are collected at one point in time, 
we must ask an eighteen-year-old high school dropout how he or she felt 
about school at age five, which is inherently less reliable. 

Survivorship bias. Suppose a high school principal reports that test scores 
for a particular cohort of students has risen steadily for four years. The 
sophomore scores for this class were better than their freshman scores. The 
scores from junior year were better still, and the senior year scores were 
best of all. We’ll stipulate that there is no cheating going on, and not even 
any creative use of descriptive statistics. Every year this cohort of students 
has done better than it did the preceding year, by every possible measure: 
mean, median, percentage of students at grade level, and so on. 

Would you (a) nominate this school leader for “principal of the year” or 
(b) demand more data? 

I say “b.” I smell survivorship bias, which occurs when some or many of 
the observations are falling out of the sample, changing the composition of 
the observations that are left and therefore affecting the results of any 
analysis. Let’s suppose that our principal is truly awful. The students in his 
school are learning nothing; each year half of them drop out. Well, that 
could do very nice things for the school’s test scores—without any 
individual student testing better. If we make the reasonable assumption that 
the worst students (with the lowest test scores) are the most likely to drop 


out, then the average test scores of those students left behind will go up 
steadily as more and more students drop out. (If you have a room of people 
with varying heights, forcing the short people to leave will raise the average 
height in the room, but it doesn’t make anyone taller.) 

The mutual fund industry has aggressively (and insidiously) seized on 
survivorship bias to make its returns look better to investors than they really 
are. Mutual funds typically gauge their performance against a key 
benchmark for stocks, the Standard & Poor’s 500, which is an index of 500 
leading public companies in America.* If the S&P 500 is up 5.3 percent for 
the year, a mutual fund is said to beat the index if it performs better than 
that, or trail the index if it does worse. One cheap and easy option for 
investors who don’t want to pay a mutual fund manager is to buy an S&P 
500 Index Fund, which is a mutual fund that simply buys shares in all 500 
stocks in the index. Mutual fund managers like to believe that they are 
savvy investors, capable of using their knowledge to pick stocks that will 
perform better than a simple index fund. In fact, it turns out to be relatively 
hard to beat the S&P 500 for any consistent stretch of time. (The S&P 500 
is essentially an average of all large stocks being traded, so just as a matter 
of math we would expect roughly half the actively managed mutual funds 
to outperform the S&P 500 in a given year and half to underperform.) Of 
course, it doesn’t look very good to lose to a mindless index that simply 
buys 500 stocks and holds them. No analysis. No fancy macro forecasting. 
And, much to the delight of investors, no high management fees. 

What is a traditional mutual fund company to do? Bogus data to the 
rescue! Here is how they can “beat the market” without beating the market. 
A large mutual company will open many new actively managed funds 
(meaning that experts are picking the stocks, often with a particular focus or 
strategy). For the sake of example, let’s assume that a mutual fund company 
opens twenty new funds, each of which has roughly a 50 percent chance of 
beating the S&P 500 in a given year. (This assumption is consistent with 
long-term data.) Now, basic probability suggests that only ten of the firm’s 
new funds will beat the S&P 500 the first year; five funds will beat it two 
years in a row; and two or three will beat it three years in a row. 

Here comes the clever part. At that point, the new mutual funds with 
unimpressive returns relative to the S&P 500 are quietly closed. (Their 


assets are folded into other existing funds.) The company can then heavily 
advertise the two or three new funds that have “consistently outperformed 
the S&P 500”—even if that performance is the stock-picking equivalent of 
flipping three heads in a row. The subsequent performance of these funds is 
likely to revert to the mean, albeit after investors have piled in. The number 
of mutual funds or investment gurus who have consistently beaten the S&P 
500 over a long period is shockingly small.* 

Healthy user bias. People who take vitamins regularly are likely to be 
healthy— because they are the kind of people who take vitamins regularly! 
Whether the vitamins have any impact is a separate issue. Consider the 
following thought experiment. Suppose public health officials promulgate a 
theory that all new parents should put their children to bed only in purple 
pajamas, because that helps stimulate brain development. Twenty years 
later, longitudinal research confirms that having worn purple pajamas as a 
child does have an overwhelmingly large positive association with success 
in life. We find, for example, that 98 percent of entering Harvard freshmen 
wore purple pajamas as children (and many still do) compared with only 3 
percent of inmates in the Massachusetts state prison system. 

Of course, the purple pajamas do not matter; but having the kind of 
parents who put their children in purple pajamas does matter. Even when 
we try to control for factors like parental education, we are still going to be 
left with unobservable differences between those parents who obsess about 
putting their children in purple pajamas and those who don’t. As New York 
Times health writer Gary Taubes explains, “At its simplest, the problem is 
that people who faithfully engage in activities that are good for them— 
taking a drug as prescribed, for instance, or eating what they believe is a 
healthy diet—are fundamentally different from those who don’t.” 6 This 
effect can potentially confound any study trying to evaluate the real effect 
of activities perceived to be healthful, such as exercising regularly or eating 
kale. We think we are comparing the health effects of two diets: kale versus 
no kale. In fact, if the treatment and control groups are not randomly 
assigned, we are comparing two diets that are being eaten by two different 
kinds of people. We have a treatment group that is different from the control 
group in two respects, rather than just one. 



If statistics is detective work, then the data are the clues. My wife spent a 
year teaching high school students in rural New Hampshire. One of her 
students was arrested for breaking into a hardware store and stealing some 
tools. The police were able to crack the case because (1) it had just snowed 
and there were tracks in the snow leading from the hardware store to the 
student’s home; and (2) the stolen tools were found inside. Good clues help. 

Like good data. But first you have to get good data, and that is a lot 
harder than it seems. 


* At the time, the disease had a mean duration of forty-three days with a standard deviation of 
twenty-four days. 

* The S&P 500 is a nice example of what an index can and should do. The index is made up of the 
share prices of the 500 leading U.S. companies, each weighted by its market value (so that bigger 
companies have more weight in the index than smaller companies). The index is a simple and 
accurate gauge of what is happening to the share prices of the largest American companies at any 
given time. 

* For a very nice discussion of why you should probably buy index funds rather than trying to beat 
the market, read A Random Walk Down Wall Street, by my former professor Burton Malkiel. 


CHAPTER 8 


The Central Limit Theorem 
The Lebron James of statistics 


At times, statistics seems almost like magic. We are able to draw sweeping 
and powerful conclusions from relatively little data. Somehow we can gain 
meaningful insight into a presidential election by calling a mere one 
thousand American voters. We can test a hundred chicken breasts for 
salmonella at a poultry processing plant and conclude from that sample 
alone that the entire plant is safe or unsafe. Where does this extraordinary 
power to generalize come from? 

Much of it comes from the central limit theorem, which is the Lebron 
James of statistics—if Lebron were also a supermodel, a Harvard professor, 
and the winner of the Nobel Peace Prize. The central limit theorem is the 
“power source” for many of the statistical activities that involve using a 
sample to make inferences about a large population (like a poll, or a test for 
salmonella). These kinds of inferences may seem mystical; in fact, they are 
just a combination of two tools that we’ve already explored: probability and 
proper sampling. Before plunging into the mechanics of the central limit 
theorem (which aren’t all that tricky), here is an example to give you the 
general intuition. 

Suppose you live in a city that is hosting a marathon. Runners from all 
over the world will be competing, which means that many of them do not 
speak English. The logistics of the race require that runners check in on the 
morning of the race, after which they are randomly assigned to buses to 
take them to the starting line. Unfortunately one of the buses gets lost on the 
way to the race. (Okay, you’re going to have to assume that no one has a 
cell phone and that the driver does not have a GPS navigation device; 
unless you want to do a lot of unpleasant math right now, just go with it.) 
As a civic leader in this city, you join the search team. 


As luck would have it, you stumble upon a broken-down bus near your 
home with a large group of unhappy international passengers, none of 
whom speaks English. This must be the missing bus! You’re going to be a 
hero! Except you have one lingering doubt ... the passengers on this bus 
are, well, very large. Based on a quick glance, you reckon that the average 
weight for this group of passengers has got to be over 220 pounds. There is 
no way that a random group of marathon runners could all be this heavy. 
You radio your message to search headquarters: “I think it’s the wrong bus. 
Keep looking.” 

Further analysis confirms your initial impression. When a translator 
arrives, you discover that this disabled bus was headed to the International 
Festival of Sausage, which is also being hosted by your city on the same 
weekend. (For the sake of verisimilitude, it is entirely possible that sausage 
festival participants might also be wearing sweat pants.) 

Congratulations. If you can grasp how someone who takes a quick look 
at the weights of passengers on a bus can infer that they are probably not on 
their way to the starting line of a marathon, then you now understand the 
basic idea of the central limit theorem. The rest is just fleshing out the 
details. And if you understand the central limit theorem, most forms of 
statistical inference will seem relatively intuitive. 

The core principle underlying the central limit theorem is that a large, 
properly drawn sample will resemble the population from which it is drawn. 
Obviously there will be variation from sample to sample (e.g., each bus 
headed to the start of the marathon will have a slightly different mix of 
passengers), but the probability that any sample will deviate massively from 
the underlying population is very low. This logic is what enabled your snap 
judgment when you boarded the broken-down bus and saw the average 
girth of the passengers on board. Fots of big people run marathons; there 
are likely to be hundreds of people who weigh over 200 pounds in any 
given race. But the majority of marathon runners are relatively thin. Thus, 
the likelihood that so many of the largest runners were randomly assigned 
to the same bus is very, very low. You could conclude with a reasonable 
degree of confidence that this was not the missing marathon bus. Yes, you 
could have been wrong, but probability tells us that most of the time you 
would have been right. 



That’s the basic intuition behind the central limit theorem. When we add 
some statistical bells and whistles, we can quantify the likelihood that you 
will be right or wrong. For example, we might calculate that in a marathon 
field of 10,000 runners with a mean weight of 155 pounds, there is less than 
a 1 in 100 chance that a random sample of 60 of those runners (our lost bus) 
would have a mean weight of 220 pounds or more. For now, let’s stick with 
the intuition; there will be plenty of time for calculations later. The central 
limit theorem enables us to make the following inferences, all of which will 
be explored in greater depth in the next chapter. 

1. If we have detailed information about some population, then we can 
make powerful inferences about any properly drawn sample from that 
population. For example, assume that a school principal has detailed 
information on the standardized test scores for all the students in his 
school (mean, standard deviation, etc.). That is the relevant 
population. Now assume that a bureaucrat from the school district 
will be arriving next week to give a similar standardized test to 100 
randomly selected students. The performance of those 100 students, 
the sample, will be used to evaluate the performance of the school 
overall. 

How much confidence can the principal have that the performance 
of those randomly chosen 100 students will accurately reflect how the 
entire student body has been performing on similar standardized tests? 
Quite a bit. According to the central limit theorem, the average test 
score for the random sample of 100 students will not typically deviate 
sharply from the average test score for the whole school. 

2. If we have detailed information about a properly drawn sample (mean 
and standard deviation), we can make strikingly accurate inferences 
about the population from which that sample was drawn. This is 
essentially working in the opposite direction from the example above, 
putting ourselves in the shoes of the school district bureaucrat who is 
evaluating various schools in the district. Unlike the school principal, 
this bureaucrat does not have (or does not trust) the standardized test 
score data that the principal has for all the students in a particular 
school, which is the relevant population. Instead, he will be 



administering a similar test of his own to a random sample of 100 
students in each school. 

Can this administrator be reasonably certain that the overall 
performance of any given school can be evaluated fairly based on the 
test scores of a sample of just 100 students from that school? Yes. The 
central limit theorem tells us that a large sample will not typically 
deviate sharply from its underlying population—which means that the 
sample results (scores for the 100 randomly chosen students) are a 
good proxy for the results of the population overall (the student body 
at a particular school). Of course, this is how polling works. A 
methodologically sound poll of 1,200 Americans can tell us a great 
deal about how the entire country is thinking. 

Think about it: if no. 1 above is true, no. 2 must also be true—and 
vice versa. If a sample usually looks like the population from which 
it’s drawn, it must also be true that a population will usually look like a 
sample drawn from that population. (If children typically look like 
their parents, parents must also typically look like their children.) 

3. If we have data describing a particular sample, and data on a 
particular population, we can infer whether or not that sample is 
consistent with a sample that is likely to be drawn from that 
population. This is the missing-bus example described at the 
beginning of the chapter. We know the mean weight (more or less) for 
the participants in the marathon. And we know the mean weight 
(more or less) for the passengers on the broken-down bus. The central 
limit theorem enables us to calculate the probability that a particular 
sample (the rotund people on the bus) was drawn from a given 
population (the marathon field). If that probability is low, then we can 
conclude with a high degree of confidence that the sample was not 
drawn from the population in question (e.g., the people on this bus 
really don’t look like a group of marathon runners headed to the 
starting line). 

4. Last, if we know the underlying characteristics of two samples, we 
can infer whether or not both samples were likely drawn from the 
same population. Let us return to our (increasingly absurd) bus 
example. We now know that a marathon is going on in the city, as 
well as the International Festival of Sausage. Assume that both groups 



have thousands of participants, and that both groups are operating 
buses, all loaded with random samples of either marathon runners or 
sausage enthusiasts. Further assume that two buses collide. (I already 
conceded that the example is absurd, so just read on.) In your capacity 
as a civic leader, you arrive on the scene and are tasked with 
determining whether or not both buses were headed to the same event 
(sausage festival or marathon). Miraculously, no one on either bus 
speaks English, but paramedics provide you with detailed information 
on the weights of all the passengers on each bus. 

From that alone, you can infer whether the two buses were likely 
headed to the same event, or to different events. Again, think about 
this intuitively. Suppose that the average weight of the passengers on 
one bus is 157 pounds, with a standard deviation of 11 pounds 
(meaning that a high proportion of the passengers weigh between 146 
pounds and 168 pounds). Now suppose that the passengers on the 
second bus have a mean weight of 211 pounds with a standard 
deviation of 21 pounds (meaning that a high proportion of the 
passengers weigh between 190 pounds and 232 pounds). Forget 
statistical formulas for a moment, and just use logic: Does it seem 
likely that the passengers on those two buses were randomly drawn 
from the same population? 

No. It seems far more likely that one bus is full of marathon runners 
and the other bus is full of sausage enthusiasts. In addition to the 
difference in average weight between the two buses, you can also see 
that the variation in weights between the two buses is very large 
compared with the variation in weights within each bus. The folks who 
weigh one standard deviation above the mean on the “skinny” bus are 
168 pounds, which is less than the folks who are one standard 
deviation below the mean on the “other” bus (190 pounds). This is a 
telltale sign (both statistically and logically) that the two samples 
likely came from different populations. 

If all of this makes intuitive sense, then you are 93.2 percent of the way 
to understanding the central limit theorem. We need to go one step further 
to put some technical heft behind the intuition. Obviously when you stuck 
your head inside the broken-down bus and saw a group of large people in 


sweatpants, you had a “hunch” that they weren’t marathoners. The central 
limit theorem allows us to go beyond that hunch and assign a degree of 
confidence to your conclusion. 

For example, some basic calculations will enable me to conclude that 99 
times out of 100 the mean weight of any randomly selected bus of 
marathoners will be within nine pounds of the mean weight of the entire 
marathon field. That’s what gives statistical heft to my hunch when I 
stumble across the broken-down bus. These passengers have a mean weight 
that is twenty-one pounds higher than the mean weight for the marathon 
field, something that should only occur by chance less than 1 time in 100. 
As a result, I can reject the hypothesis that this is a missing marathon bus 
with 99 percent confidence—meaning I should expect my inference to be 
correct 99 times out of 100. 

And yes, probability suggests that on average I’ll be wrong 1 time in 100. 

This kind of analysis all stems from the central limit theorem, which, from a 
statistical standpoint, has Lebron James-like power and elegance. 
According to the central limit theorem, the sample means for any 
population will be distributed roughly as a normal distribution around the 
population mean. Hang on for a moment as we unpack that statement. 

1. Suppose we have a population, like our marathon field, and we are 
interested in the weights of its members. Any sample of runners, such 
as each bus of sixty runners, will have a mean. 

2. If we take repeated samples, such as picking random groups of sixty 
runners from the field over and over, then each of those samples will 
have its own mean weight. These are the sample means. 

3. Most of the sample means will be very close to the population mean. 
Some will be a little higher. Some will be a little lower. Just as a 
matter of chance, a very few will be significantly higher than the 
population mean, and a very few will be significantly lower. 

Cue the music, because this is where everything comes together in a 
powerful crescendo .. . 

4. The central limit theorem tells us that the sample means will be 
distributed roughly as a normal distribution around the population 



mean. The normal distribution, as you may remember from Chapter 2, 
is the bell-shaped distribution (e.g., adult men’s heights) in which 68 
percent of the observations lie within one standard deviation of the 
mean, 95 percent lie within two standard deviations, and so on. 

5. All of this will be true no matter what the distribution of the 
underlying population looks like. The population from which the 
samples are being drawn does not have to have a normal distribution 
in order for the sample means to be distributed normally. 

Let’s think about some real data, say, the household income distribution 
in the United States. Household income is not distributed normally in 
America; instead, it tends to be skewed to the right. No household can earn 
less than $0 in a given year, so that must be the lower bound for the 
distribution. Meanwhile, a small group of households can earn staggeringly 
large annual incomes—hundreds of millions or even billions of dollars in 
some cases. As a result, we would expect the distribution of household 
incomes to have a long right tail—something like this: 



The median household income in the United States is roughly $51,900; 
the mean household income is $70,900. 1 (People like Bill Gates pull the 
mean household income to the right, just as he did when he walked in to the 
bar in Chapter 2.) Now suppose we take a random sample of 1,000 U.S. 
households and gather information on annual household income. On the 
basis of the information above, and the central limit theorem, what can we 
infer about this sample? 







Quite a lot, it turns out. First of all, our best guess for what the mean of 
any sample will be is the mean of the population from which it’s drawn. 
The whole point of a representative sample is that it looks like the 
underlying population. A properly drawn sample will, on average, look like 
America. There will be hedge fund managers and homeless people and 
police officers and everyone else—all roughly in proportion to their 
frequency in the population. Therefore, we would expect the mean 
household income for a representative sample of 1,000 American 
households to be about $70,900. Will it be exactly that? No. But it shouldn’t 
be wildly different either. 

If we took multiple samples of 1,000 households, we would expect the 
different sample means to cluster around the population mean, $70,900. We 
would expect some means to be higher, and some to be lower. Might we get 
a sample of 1,000 households with a mean household income of $427,000? 
Sure, that’s possible—but highly unlikely. (Remember, our sampling 
methodology is sound; we are not conducting a survey in the parking lot of 
the Greenwich Country Club.) It’s also highly unlikely that a proper sample 
of 1,000 American households would have a mean income of $8,000. 

That’s all just basic logic. The central limit theorem enables us to go one 
step further by describing the expected distribution of those different 
sample means as they cluster around the population mean. Specifically, the 
sample means will form a normal distribution around the population mean, 
which in this case is $70,900. Remember, the shape of the underlying 
population doesn’t matter. The household income distribution in the United 
States is plenty skewed, but the distribution of the sample means will not be 
skewed. If we were to take 100 different samples, each with 1,000 
households, and plotted the frequency of our results, we would expect those 
sample means to form the familiar “bell-shaped” distribution around 
$70,900. 

The larger the number of samples, the more closely the distribution will 
approximate the normal distribution. And the larger the size of each sample, 
the tighter that distribution will be. To test this result, let’s do a fun 
experiment with real data on the weights of real Americans. The University 
of Michigan conducts a longitudinal study called Americans’ Changing 
Lives, which consists of detailed observations on several thousand 
American adults, including their weights. The weight distribution is skewed 



slightly right, because it’s biologically easier to be 100 pounds overweight 
than it is to be 100 pounds underweight. The mean weight for ah adults in 
the study is 162 pounds. 

Using basic statistical software, we can direct the computer to take a 
random sample of 100 individuals from the Changing Lives data. In fact, 
we can do this over and over again to see how the results fit with what the 
central limit theorem would predict. Here is a graph of the distribution of 
100 sample means (rounded to the nearest pound) randomly generated from 
the Changing Lives data. 

100 Sample Means, n = 100 



The larger the sample size and the more samples taken, the more closely 
the distribution of sample means will approximate the normal curve. (As a 
rule of thumb, the sample size must be at least 30 for the central limit 
theorem to hold true.) This makes sense. A larger sample is less likely to be 
affected by random variation. A sample of 2 can be highly skewed by 1 
particularly large or small person. In contrast, a sample of 500 will not be 
unduly affected by a few particularly large or small people. 

We are now very close to making all of our statistical dreams come true! 
The sample means are distributed roughly as a normal curve, as described 
above. The power of a normal distribution derives from the fact that we 
know roughly what proportion of observations will he within one standard 
deviation above or below the mean (68 percent); what proportion of 
observations will lie within two standard deviations above or below the 
mean (95 percent); and so on. This is powerful stuff. 

Earlier in this chapter, I pointed out that we could infer intuitively that a 
busload of passengers with a mean weight twenty-five pounds higher than 






the mean weight for the whole marathon field was probably not the lost bus 
of runners. To quantify that intuition—to be able to say that this inference 
will be correct 95 percent of the time, or 99 percent, or 99.9 percent—we 
need just one more technical concept: the standard error. 

The standard error measures the dispersion of the sample means. How 
tightly do we expect the sample means to cluster around the population 
mean? There is some potential confusion here, as we have now introduced 
two different measures of dispersion: the standard deviation and the 
standard error. Here is what you need to remember to keep them straight: 

1. The standard deviation measures dispersion in the underlying 
population. In this case, it might measure the dispersion of the 
weights of all the participants in the Framingham Heart Study, or the 
dispersion around the mean for the entire marathon field. 

2. The standard error measures the dispersion of the sample means. If 
we draw repeated samples of 100 participants from the Framingham 
Heart Study, what will the dispersion of those sample means look 
like? 

3. Here is what ties the two concepts together: The standard error is the 
standard deviation of the sample means! Isn’t that kind of cool? 

A large standard error means that the sample means are spread out 
widely around the population mean; a small standard error means that they 
are clustered relatively tightly. Here are three real examples from the 
Changing Lives data. 
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The second distribution, which has a larger sample size, is more tightly 
clustered around the mean than the first distribution. The larger sample size 
makes it less likely that a sample mean will deviate sharply from the 
population mean. The final set of sample means is drawn only from a subset 
of the population, women in the study. Since the weights of women in the 
data set are less diffuse than the weights of all persons in the population, it 
stands to reason that the weights of samples drawn just from the women 
would be less dispersed than samples drawn from the whole Changing 
Lives population. (These samples are also clustered around a slightly 
different population mean, since the mean weight for all females in the 
Changing Lives study is different from the mean weight for the entire 
population in the study.) 

The pattern that you saw above holds true in general. Sample means will 
cluster more tightly around the population mean as the size of each sample 
gets larger (e.g., our sample means were more tightly clustered when we 
took samples of 100 rather than 30). And the sample means will cluster less 
tightly around the population mean when the underlying population is more 








spread out (e.g., our sample means for the entire Changing Lives population 
were more dispersed than the sample means for just the females in the 
study). 

If you’ve followed the logic this far, then the formula for the standard 
error follows naturally: 

SE = s/frT, where s is the standard deviation of the population from which 
the sample is drawn, and n is the size of the sample. Keep your head about 
you! Don’t let the appearance of letters mess up the basic intuition. The 
standard error will be large when the standard deviation of the underlying 
distribution is large. A large sample drawn from a highly dispersed 
population is also likely to be highly dispersed; a large sample from a 
population clustered tightly around the mean is also likely to be clustered 
tightly around the mean. If we are still looking at weight, we would expect 
the standard error for a sample drawn from the entire Changing Lives 
population to be larger than the standard error for a sample drawn only from 
the men in their twenties. This is why the standard deviation (s) is in the 
numerator. 

Similarly, we would expect the standard error to get smaller as the 
sample size gets larger, since large samples are less prone to distortion by 
extreme outliers. This is why the sample size (n) is in the denominator. (The 
reason we take the square root of n will be left for a more advanced text; the 
basic relationship is what’s important here.) 

In the case of the Changing Lives data, we actually know the standard 
deviation of the population; often that is not the case. Lor large samples, we 
can assume that the standard deviation of the sample is reasonably close to 
the standard deviation of the population. 

Linally, we have arrived at the payoff for all of this. Because the sample 
means are distributed normally (thanks to the central limit theorem), we can 
harness the power of the normal curve. We expect that roughly 68 percent 
of all sample means will lie within one standard error of the population 
mean; 95 percent of the sample means will lie within two standard errors of 
the population mean; and 99.7 percent of the sample means will lie within 
three standard errors of the population mean. 


Frequency Distribution of Sample Means 



99 . 7 % 

So let’s return to a variation on our lost-bus example, only now we can 
substitute numbers for intuition. (The example itself will remain absurd; the 
next chapter will have plenty of less absurd, real-world examples.) Suppose 
that the Changing Lives study has invited all of the individuals in the study 
to meet in Boston for a weekend of data gathering and revelry. The 
participants are loaded randomly onto buses and ferried among the 
buildings at the testing facility where they are weighed, measured, poked, 
prodded, and so on. Shockingly, one bus goes missing, a fact that is 
broadcast on the local news. At around that time, you are driving back from 
the Festival of Sausage when you see a crashed bus on the side of the road. 
Apparently the bus swerved to miss a wild fox crossing the road, and all of 
the passengers are unconscious but not seriously hurt. (I need them to be 
uncommunicative for the example to work, but I don’t want their injuries to 
be too disturbing.) Paramedics on the scene inform you that the mean 
weight of the 62 passengers on the bus is 194 pounds. Also, the fox that the 
bus swerved to avoid was clipped slightly and appears to have a broken 
hind leg. 

Fortunately you know the mean weight and standard deviation for the 
entire Changing Lives population, you have a working knowledge of the 
central limit theorem, and you know how to administer first aid to a wild 
fox. The mean weight for the Changing Lives participants is 162; the 



standard deviation is 36. From that information, we can calculate the 
standard error for a 62-person sample (the number of unconscious 
passengers on the bus): s/fe i - 36/7.9, or 4 . 6 . 

The difference between the sample mean (194 pounds) and the 
population mean (162 pounds) is 32 pounds, or well more than three 
standard errors. We know from the central limit theorem that 99.7 percent 
of all sample means will lie within three standard errors of the population 
mean. That makes it extremely unlikely that this bus represents a random 
group of Changing Lives participants. In your duty as a civic leader, you 
call the study officials to tell them that this is probably not their missing 
bus, only now you can offer statistical evidence, rather than just “a hunch.” 
You report to the Changing Lives folks that you can reject the possibility 
that this is the missing bus at the 99.7 percent confidence level. And since 
you are talking to researchers, they actually understand what you are talking 
about. 

Your analysis is further confirmed when paramedics conduct blood tests 
on the bus passengers and discover that the mean cholesterol level for the 
busload of passengers is five standard errors above the mean cholesterol 
level for the Changing Lives study participants. That suggests, correctly it 
later turns out, that the unconscious passengers are involved with the 
Festival of Sausage. 

[There is a happy ending. When the bus passengers regained 
consciousness. Changing Lives study officials offered them counseling on 
the dangers of a diet high in saturated fats, causing many of them to adopt 
more heart-healthy eating habits. Meanwhile, the fox was nurtured back to 
health at a local wildlife preserve and was eventually released back into the 
wild.]* 

I’ve tried to stick with the basics in this chapter. You should note that for 
the central limit theorem to apply, the sample sizes need to be relatively 
large (over 30 as a rule of thumb). We also need a relatively large sample if 
we are going to assume that the standard deviation of the sample is roughly 
the same as the standard deviation of the population from which it is drawn. 
There are plenty of statistical fixes that can be applied when these 
conditions are not met—but that is all frosting on the cake (and maybe even 


sprinkles on the frosting on the cake). The “big picture” here is simple and 
massively powerful: 

1. If you draw large, random samples from any population, the means of 
those samples will be distributed normally around the population 
mean (regardless of what the distribution of the underlying population 
looks like). 

2. Most sample means will lie reasonably close to the population mean; 
the standard error is what defines “reasonably close.” 

3. The central limit theorem tells us the probability that a sample mean 
will lie within a certain distance of the population mean. It is 
relatively unlikely that a sample mean will lie more than two standard 
errors from the population mean, and extremely unlikely that it will 
lie three or more standard errors from the population mean. 

4. The less likely it is that an outcome has been observed by chance, the 
more confident we can be in surmising that some other factor is in 
play. 

That’s pretty much what statistical inference is about. The central limit 
theorem is what makes most of it possible. And until Lebron James wins as 
many NBA championships as Michael Jordan (six), the central limit 
theorem will be far more impressive than he is. 


* Note the clever use of false precision here. 

* When the standard deviation for the population is calculated from a smaller sample, the formula is 
tweaked slightly: SE = *^n- 1 ■ This helps to account for the fact that the dispersion in a small sample 
may understate the dispersion of the full population. This is not highly relevant to the bigger points in 
this chapter. 

* My University of Chicago colleague Jim Sallee makes a very important critique of the missing-bus 
examples. He points out that very few buses ever go missing. So if we happen to be looking for a 
missing bus, any bus that turns up lost or crashed is likely to be that bus, regardless of the weight of 
the passengers on the bus. He’s right. (Think about it: if you lose your child in a supermarket, and the 
store manager tells you that there happens to be a lost child standing near register six, you would 
conclude immediately that it’s probably your child.) We’re therefore going to have to add one more 
element of absurdity to these examples and pretend that buses go missing all the time. 



CHAPTER 9 


Inference 
Why my statistics professor 
thought I might have cheated 


In the spring of my senior year of college, I took a statistics class. I wasn’t 
particularly enamored of statistics or of most math-based disciplines at that 
time, but I had promised my dad that I would take the course if I could 
leave school for ten days to go on a family trip to the Soviet Union. So, I 
basically took stats in exchange for a trip to the USSR. This turned out to be 
a great deal, both because I liked statistics more than I thought I would and 
because I got to visit the USSR in the spring of 1988. Who knew that the 
country wouldn’t be around in its communist form for much longer? 

This story is actually relevant to the chapter; the point is that I wasn’t as 
devoted to my statistics course during the term as I might have been. 
Among other responsibilities, I was also writing a senior honors thesis that 
was due about halfway through the term. We had regular quizzes in the 
statistics course, many of which I ignored or failed. I studied a little for the 
midterm and did passably well—literally. But a few weeks before the end of 
the term, two things happened. First, I finished my thesis, giving me 
copious amounts of new free time. And second, I realized that statistics 
wasn’t nearly as difficult as I had been making it out to be. I began studying 
the stats book and doing the work from earlier in the course. I earned an A 
on the final exam. 

That’s when my statistics professor, whose name I’ve long since 
forgotten, called me into his office. I don’t remember exactly what he said, 
but it was something along the lines of “You really did much better on the 
final than you did on the midterm.” This was not a congratulatory visit 
during which I was recognized for finally doing serious work in the class. 
There was an implicit accusation (though not an explicit one) in his 
summons; the expectation was that I would explain why I did so much 


better on the final exam than the midterm. In short, this guy suspected that I 
might have cheated. Now that I’ve taught for many years. I’m more 
sympathetic to his line of thinking. In nearly every course I’ve taught, there 
is a striking degree of correlation between a student’s performance on the 
midterm and on the final. It is highly unusual for a student to score below 
average on the midterm and then near the top of the class on the final. 

I explained that I had finished my thesis and gotten serious about the 
class (by doing things like reading the assigned textbook chapters and doing 
the homework). He seemed content with this explanation, and I left, still 
somewhat unsettled by the implicit accusation. 

Believe it or not, this anecdote embodies much of what you need to know 
about statistical inference, including both its strengths and its potential 
weaknesses. Statistics cannot prove anything with certainty. Instead, the 
power of statistical inference derives from observing some pattern or 
outcome and then using probability to determine the most likely 
explanation for that outcome. Suppose a strange gambler arrives in town 
and offers you a wager: He wins $1,000 if he rolls a six with a single die; 
you win $500 if he rolls anything else—a pretty good bet from your 
standpoint. He then proceeds to roll ten sixes in a row, taking $10,000 from 
you. 

One possible explanation is that he was lucky. An alternative explanation 
is that he cheated somehow. The probability of rolling ten sixes in a row 
with a fair die is roughly 1 in 60 million. You can’t prove that he cheated, 
but you ought at least to inspect the die. 

Of course, the most likely explanation is not always the right 
explanation. Extremely rare things happen. Linda Cooper is a South 
Carolina woman who has been struck by lightning four times. 1 (The Federal 
Emergency Management Administration estimates the probability of getting 
hit by lightning just once as 1 in 600,000.) Linda Cooper’s insurance 
company cannot deny her coverage simply because her injuries are 
statistically improbable. To return to my undergraduate statistics exam, the 
professor had reasonable cause to be suspicious. He saw a pattern that was 
highly unlikely; this is exactly how investigators spot cheating on 
standardized exams and how the SEC catches insider trading. But an 
unlikely pattern is just an unlikely pattern unless it is corroborated by 


additional evidence. Later in the chapter we will discuss errors that can 
arise when probability steers us wrong. 

For now, we should appreciate that statistical inference uses data to 
address important questions. Is a new drug effective in treating heart 
disease? Do cell phones cause brain cancer? Please note that I’m not 
claiming that statistics can answer these kinds of questions unequivocally; 
instead, inference tells us what is likely, and what is unlikely. Researchers 
cannot prove that a new drug is effective in treating heart disease, even 
when they have data from a carefully controlled clinical trial. After all, it is 
entirely possible that there will be random variation in the outcomes of 
patients in the treatment and control groups that are unrelated to the new 
drug. If 53 out of 100 patients taking the new heart disease medication 
showed marked improvement compared with 49 patients out of 100 
receiving a placebo, we would not immediately conclude that the new 
medication is effective. This is an outcome that can easily be explained by 
chance variation between the two groups rather than by the new drug. 

But suppose instead that 91 out of 100 patients receiving the new drug 
show marked improvement, compared with 49 out of 100 patients in the 
control group. It is still possible that this impressive result is unrelated to 
the new drug; the patients in the treatment group may be particularly lucky 
or resilient. But that is now a much less likely explanation. In the formal 
language of statistical inference, researchers would likely conclude the 
following: (1) If the experimental drug has no effect, we would rarely see 
this amount of variation in outcomes between those who are receiving the 
drug and those who are taking the placebo. (2) It is therefore highly 
improbable that the drug has no positive effect. (3) The alternative—and 
more likely—explanation for the pattern of data observed is that the 
experimental drug has a positive effect. 

Statistical inference is the process by which the data speak to us, 
enabling us to draw meaningful conclusions. This is the payoff! The point 
of statistics is not to do myriad rigorous mathematical calculations; the 
point is to gain insight into meaningful social phenomena. Statistical 
inference is really just the marriage of two concepts that we’ve already 
discussed: data and probability (with a little help from the central limit 
theorem). I have taken one major methodological shortcut in this chapter. 
All of the examples will assume that we are working with large, properly 



drawn samples. This assumption means that the central limit theorem 
applies, and that the mean and standard deviation for any sample will be 
roughly the same as the mean and standard deviation for the population 
from which it is drawn. Both of these things make our calculations easier. 

Statistical inference is not dependent on this simplifying assumption, but 
the assorted methodological fixes for dealing with small samples or 
imperfect data often get in the way of understanding the big picture. The 
purpose here is to introduce the power of statistical inference and to explain 
how it works. Once you get that, it’s easy enough to layer on complexity. 

One of the most common tools in statistical inference is hypothesis testing. 
Actually, I’ve already introduced this concept—just without the fancy 
terminology. As noted above, statistics alone cannot prove anything; 
instead, we use statistical inference to accept or reject explanations on the 
basis of their relative likelihood. To be more precise, any statistical 
inference begins with an implicit or explicit null hypothesis. This is our 
starting assumption, which will be rejected or not on the basis of subsequent 
statistical analysis. If we reject the null hypothesis, then we typically accept 
some alternative hypothesis that is more consistent with the data observed. 
For example, in a court of law the starting assumption, or null hypothesis, is 
that the defendant is innocent. The job of the prosecution is to persuade the 
judge or jury to reject that assumption and accept the alternative hypothesis, 
which is that the defendant is guilty. As a matter of logic, the alternative 
hypothesis is a conclusion that must be true if we can reject the null 
hypothesis. Consider some examples. 

Null hypothesis: This new experimental drug is no more effective at 
preventing malaria than a placebo. 

Alternative hypothesis: This new experimental drug can help to prevent 
malaria. 

The data: One group is randomly chosen to receive the new experimental 
drug, and a control group receives a placebo. At the end of some period of 
time, the group receiving the experimental drug has far fewer cases of 
malaria than the control group. This would be an extremely unlikely 
outcome if the experimental drug had no medical impact. As a result, we 
reject the null hypothesis that the new drug has no impact (beyond that of a 



placebo), and we accept the logical alternative, which is our alternative 
hypothesis: This new experimental drug can help to prevent malaria. 

This methodological approach is strange enough that we should do one 
more example. Again, note that the null hypothesis and alternative 
hypothesis are logical complements. If one is true, the other is not true. Or, 
if we reject one statement, we must accept the other. 

Null hypothesis: Substance abuse treatment for prisoners does not reduce 
their rearrest rate after leaving prison. 

Alternative hypothesis: Substance abuse treatment for prisoners will 
make them less likely to be rearrested after they are released. 

The (hypothetical) data: Prisoners were randomly assigned into two 
groups; the “treatment” group received substance abuse treatment and the 
control group did not. (This is one of those cool occasions when the 
treatment group actually gets treatment!) At the end of five years, both 
groups have similar rearrest rates. In this case, we cannot reject the null 
hypothesis. The data have given us no reason to discard our beginning 
assumption that substance abuse treatment is not an effective tool for 
keeping ex-offenders from returning to prison. 

It may seem counterintuitive, but researchers often create a null 
hypothesis in hopes of being able to reject it. In both of the examples above, 
a research “success” (finding a new malaria drug or reducing recidivism) 
involved rejecting the null hypothesis. The data made that possible in only 
one of the cases (the malaria drug). 

In a courtroom, the threshold for rejecting the presumption of innocence is 
the qualitative assessment that the defendant is “guilty beyond a reasonable 
doubt.” The judge or jury is left to define what exactly that means. Statistics 
harnesses the same basic idea, but “guilty beyond a reasonable doubt” is 
defined quantitatively instead. Researchers typically ask. If the null 
hypothesis is true, how likely is it that we would observe this pattern of data 
by chance? To use a familiar example, medical researchers might ask. If this 
experimental drug has no effect on heart disease (our null hypothesis), how 
likely is it that 91 out of 100 patients getting the drug would show 
improvement compared with only 49 out of 100 patients getting a placebo? 
If the data suggest that the null hypothesis is extremely unlikely—as in this 


medical example—then we must reject it and accept the alternative 
hypothesis (that the drug is effective in treating heart disease). 

In that vein, let us revisit the Atlanta standardized cheating scandal 
alluded to at several points in the book. The Atlanta test score results were 
first flagged because of a high number of “wrong-to-right” erasures. 
Obviously students taking standardized exams erase answers all the time. 
And some groups of students may be particularly lucky in their changes, 
without any cheating necessarily being involved. For that reason, the null 
hypothesis is that the standardized test scores for any particular school 
district are legitimate and that any irregular patterns of erasures are merely 
a product of chance. We certainly do not want to be punishing students or 
administrators because an unusually high proportion of students happened 
to make sensible changes to their answer sheets in the final minutes of an 
important state exam. 

But “unusually high” does not begin to describe what was happening in 
Atlanta. Some classrooms had answer sheets on which the number of 
wrong-to-right erasures were twenty to fifty standard deviations above the 
state norm. (To put this in perspective, remember that most observations in 
a distribution typically fall within two standard deviations of the mean.) So 
how likely was it that Atlanta students happened to erase massive numbers 
of wrong answers and replace them with correct answers just as a matter of 
chance? The official who analyzed the data described the probability of the 
Atlanta pattern occurring without cheating as roughly equal to the chance of 
having 70,000 people show up for a football game at the Georgia Dome 
who all happen to be over seven feet tall. 2 Could it happen? Yes. Is it 
likely? Not so much. 

Georgia officials still could not convict anybody of wrongdoing, just as 
my professor could not (and should not) have had me thrown out of school 
because my final exam grade in statistics was out of sync with my midterm 
grade. Atlanta officials could not prove that cheating was going on. They 
could, however, reject the null hypothesis that the results were legitimate. 
And they could do so with a “high degree of confidence,” meaning that the 
observed pattern was nearly impossible among normal test takers. They 
therefore explicitly accepted the alternative hypothesis, which is that 
something fishy was going on. (I suspect they used more official-sounding 


language.) Subsequent investigation did in fact uncover the “smoking 
erasers.” There were reports of teachers changing answers, giving out 
answers, allowing low-scoring children to copy from high-scoring children, 
and even pointing to answers while standing over students’ desks. The most 
egregious cheating involved a group of teachers who held a weekend pizza 
party during which they went through exam sheets and changed students’ 
answers. 

In the Atlanta example, we could reject the null hypothesis of “no 
cheating” because the pattern of test results was so wildly improbable in the 
absence of foul play. But how implausible does the null hypothesis have to 
be before we can reject it and invite some alternative explanation? 

One of the most common thresholds that researchers use for rejecting a 
null hypothesis is 5 percent, which is often written in decimal form: .05. 
This probability is known as a significance level, and it represents the upper 
bound for the likelihood of observing some pattern of data if the null 
hypothesis were true. Stick with me for a moment, because it’s not really 
that complicated. 

Let’s think about a significance level of .05. We can reject a null 
hypothesis at the .05 level if there is less than a 5 percent chance of getting 
an outcome at least as extreme as what we’ve observed if the null 
hypothesis were true. A simple example can make this much clearer. I hate 
to do this to you, but assume once again that you’ve been put on missing- 
bus duty (in part because of your valiant efforts in the last chapter). Only 
now you are working full-time for the researchers at the Changing Lives 
study, and they have given you some excellent data to help inform your 
work. Each bus operated by the organizers of the study has roughly 60 
passengers, so we can treat the passengers on any bus as a random sample 
drawn from the entire Changing Lives population. You are awakened early 
one morning by the news that a bus in the Boston area has been hijacked by 
a pro-obesity terrorist group. Your job is to drop from a helicopter onto the 
roof of the moving bus, sneak inside through the emergency exit, and then 
stealthily determine whether the passengers are Changing Lives 
participants, solely on the basis of their weights. (Seriously, this is no more 
implausible than most action-adventure plots, and it’s a lot more 
educational.) 


As the helicopter takes off from the commando base, you are given a 
machine gun, several grenades, a watch that also functions as a high- 
resolution video camera, and the data that we calculated in the last chapter 
on the mean weight and standard error for samples drawn from the 
Changing Lives participants. Any random sample of 60 participants will 
have an expected mean weight of 162 pounds and standard deviation of 36 
pounds, since that is the mean and standard deviation for all participants in 
the study (the population). With those data, we can calculate the standard 
error for the sample mean: s/JV = 36/^60 = 36 / 7.75 = 4 . 6 . At mission control, the 
following distribution is scanned onto the inside of your right retina, so that 
you can refer to it after penetrating the moving bus and secretly weighing 
all the passengers inside. 

Distribution of Sample Means 



Mean weight for the sample 


As the distribution above shows, we would expect roughly 95 percent of 
all 60-person samples drawn from the Changing Lives participants to have a 
mean weight within two standard errors of the population mean, or roughly 
between 153 pounds and 171 pounds." Conversely, only 5 times out of 100 
would a sample of 60 persons randomly drawn from the Changing Lives 
participants have a mean weight that is greater than 171 pounds or less than 
153 pounds. (You are conducting what is known as a “two-tailed” 
hypothesis test; the difference between this and a “one-tailed” test will be 
covered in an appendix at the end of the chapter.) Your handlers on the 









counterterrorism task force have decided that .05 is the significance level 
for your mission. If the mean weight of the 60 passengers on the hijacked 
bus is above 171 or below 153, then you will reject the null hypothesis that 
the bus contains Changing Lives participants, accept the alternative 
hypothesis that the bus contains 60 people headed somewhere else, and 
await further orders. 

You successfully drop into the moving bus and secretly weigh all the 
passengers. The mean weight for this 60-person sample is 136 pounds, 
which falls more than two standard errors below the mean. (Another 
important clue is that all of the passengers are children wearing “Glendale 
Hockey Camp” T-shirts.) 

Per your mission instructions, you can reject the null hypothesis that this 
bus contains a random sample of 60 Changing Lives study participants at 
the .05 significance level. This means (1) the mean weight on the bus falls 
into a range that we would expect to observe only 5 times in 100 if the null 
hypothesis were true and this were really a bus full of Changing Lives 
passengers; (2) you can reject the null hypothesis at the .05 significance 
level; and (3) on average, 95 times out of 100 you will have correctly 
rejected the null hypothesis, and 5 times out of 100 you will be wrong, 
meaning that you have concluded that this is not a bus of Changing Lives 
participants, when in fact it is. This sample of Changing Lives folks just 
happens to have a mean weight that is particularly high or low relative to 
the mean for the study participants overall. 

The mission is not quite over. Your handler at mission control (played by 
Angelina Jolie in the film version of this example) asks you to calculate a p- 
value for your result. The p-value is the specific probability of getting a 
result at least as extreme as the one you’ve observed if the null hypothesis 
is true. The mean weight for the passengers on this bus is 136, which is 5.7 
standard errors below the mean for the Changing Lives study participants. 
The probability of getting a result at least that extreme if this really were a 
sample of Changing Lives participants is less than .0001. (In a research 
document, this would be reported as p<.0001.) With your mission complete, 
you leap from the moving bus and land safely in the passenger seat of a 
convertible driving in an adjacent lane. 

[This story has a happy ending as well. Once the pro-obesity terrorists 
learn more about your city’s International Festival of Sausage, they agree to 



abandon violence and work peacefully to promote obesity by expanding 
and promoting sausage festivals around the world.] 

If the .05 significance level seems somewhat arbitrary, that’s because it is. 
There is no single standardized statistical threshold for rejecting a null 
hypothesis. Both .01 and .1 are also reasonably common thresholds for 
doing the kind of analysis described above. 

Obviously rejecting the null hypothesis at the .01 level (meaning that 
there is less than a 1 in 100 chance of observing a result in this range if the 
null hypothesis were true) carries more statistical heft than rejecting the null 
hypothesis at the .1 level (meaning that there is less than a 1 in 10 chance of 
observing this result if the null hypothesis were true). The pros and cons of 
different significance levels will be discussed later in the chapter. For now, 
the important point is that when we can reject a null hypothesis at some 
reasonable significance level, the results are said to be “statistically 
significant.” 

Here is what that means in real life. When you read in the newspaper that 
people who eat twenty bran muffins a day have lower rates of colon cancer 
than people who don’t eat prodigious amounts of bran, the underlying 
academic research probably looked something like this: (1) In some large 
data set, researchers determined that individuals who ate at least twenty 
bran muffins a day had a lower incidence of colon cancer than individuals 
who did not report eating much bran. (2) The researchers’ null hypothesis 
was that eating bran muffins has no impact on colon cancer. (3) The 
disparity in colon cancer outcomes between those who ate lots of bran 
muffins and those who didn’t could not easily be explained by chance 
alone. More specifically, if eating bran muffins has no true association with 
colon cancer, the probability of getting such a wide gap in cancer incidence 
between bran eaters and non-bran eaters by chance alone is lower than 
some threshold, such as .05. (This threshold should be established by the 
researchers before they do their statistical analysis to avoid choosing a 
threshold after the fact that is convenient for making the results look 
significant.) (4) The academic paper probably contains a conclusion that 
says something along these lines: “We find a statistically significant 
association between daily consumption of twenty or more bran muffins and 



a reduced incidence of colon cancer. These results are significant at the .05 
level.” 

When I subsequently read about that study in the Chicago Sun-Times 
while eating my breakfast of bacon and eggs, the headline is probably more 
direct and interesting: “20 Bran Muffins a Day Help Keep Colon Cancer 
Away.” However, that newspaper headline, while much more interesting to 
read than the academic paper, may also be introducing a serious inaccuracy. 
The study does not actually claim that eating bran muffins lowers an 
individual’s risk of getting colon cancer; it merely shows a negative 
correlation between the consumption of bran muffins and the incidence of 
colon cancer in one large data set. This statistical association is not 
sufficient to prove that the bran muffins cause the improved health 
outcome. After all, the kind of people who eat bran muffins (particularly 
twenty a day!) may do lots of other things that lower their cancer risk, such 
as eating less red meat, exercising regularly, getting screened for cancer, 
and so on. (This is the “healthy user bias” from Chapter 7.) Is it the bran 
muffins at work here, or is it other behaviors or personal attributes that 
happen to be shared by people who eat a lot of bran muffins? This 
distinction between correlation and causation is crucial to the proper 
interpretation of statistical results. We will revisit the idea that “correlation 
does not equal causation” later in the book. 

I should also point out that statistical significance says nothing about the 
size of the association. People who eat lots of bran muffins may have a 
lower incidence of colon cancer—but how much lower? The difference in 
colon cancer rates for bran muffin eaters and non-bran muffin eaters may 
be trivial; the finding of statistical significance means only that the 
observed effect, however tiny, is not likely to be a coincidence. Suppose 
you stumble across a well-designed study that has found a statistically 
significant positive relationship between eating a banana before the SAT 
and achieving a higher score on the math portion of the test. One of the first 
questions you want to ask is. How big is this effect? It could easily be .9 
points; on a test with a mean score of 500, that is not a life-changing figure. 
In Chapter 11, we will return to this crucial distinction between size and 
significance when it comes to interpreting statistical results. 

Meanwhile, a finding that there is “no statistically significant 
association” between two variables means that any relationship between the 



two variables can reasonably be explained by chance alone. The New York 
Times recently ran an expose on technology companies peddling software 
that they claim improves student performance, when the data suggest 
otherwise. 3 According to the article, Carnegie Mellon University sells a 
software program called Cognitive Tutor with this bold claim: 
“Revolutionary Math Curricula. Revolutionary Results.” Yet an assessment 
of Cognitive Tutor conducted by the U.S. Department of Education 
concluded that the product had “no discernible effects” on the test scores of 
high school students. (The Times suggested that the appropriate marketing 
campaign should be “Undistinguished Math Curricula. Unproven Results.”) 
In fact, a study of ten software products designed to teach skills such as 
math or reading found that nine of them “did not have statistically 
significant effects on test scores.” In other words, federal researchers cannot 
rule out mere chance as the cause of any variation in the performance of 
students who use these software products and students who do not. 

Let me pause here to remind you why all of this matters. An article in the 
Wall Street Journal in May of 2011 carried the headline “Link in Autism, 
Brain Size.” This is an important breakthrough, as the causes of autism 
spectrum disorder remain elusive. The first sentence of the Wall Street 
Journal story, which summarized a paper published in the Archives of 
General Psychiatry, reports, “Children with autism have larger brains than 
children without the disorder, and the growth appears to occur before age 2, 
according to a new study released on Monday.” 4 On the basis of brain 
imaging conducted on 59 children with autism and 38 children without 
autism, researchers at the University of North Carolina reported that 
children with autism have brains that are up to 10 percent larger than those 
of children of the same age without autism. 

Here is the relevant medical question: Is there a physiological difference 
in the brains of young children who have autism spectrum disorder? If so, 
this insight might lead to a better understanding of what causes the disorder 
and how it can be treated or prevented. 

And here is the relevant statistical question: Can researchers make 
sweeping inferences about autism spectrum disorder in general that are 
based on a study of a seemingly small group of children with autism (59) 
and an even smaller control group (38)—a mere 97 subjects in all? The 


answer is yes. The researchers concluded that the probability of observing 
the differences in total brain size that they found in their two samples would 
be a mere 2 in 1,000 (p = .002) if there is in fact no real difference in brain 
size between children with and without autism spectrum disorder in the 
overall population. 

I tracked down the original study in the Archives of General Psychiatry . 5 
The methods used by these researchers are no more sophisticated than the 
concepts we’ve covered so far. I will give you a quick tour of the 
underpinnings of this socially and statistically significant result. First, you 
should recognize that each group of children, the 59 with autism and the 38 
without autism, constitutes a reasonably large sample drawn from their 
respective populations—all children with and without autism spectrum 
disorder. The samples are large enough that the central limit will apply. If 
you’ve already tried to block the last chapter out of your mind, I will 
remind you of what the central limit theorem tells us: (1) the sample means 
for any population will be distributed roughly as a normal distribution 
around the true population mean; (2) we would expect the sample mean and 
the sample standard deviation to be roughly equal to the mean and standard 
deviation for the population from which it is drawn; and (3) roughly 68 
percent of sample means will lie within one standard error of the population 
mean, roughly 95 percent will lie within two standard errors of the 
population mean, and so on. 

In less technical language, this all means that any sample should look a 
lot like the population from which it is drawn; while every sample will be 
different, it would be relatively rare for the mean of a properly drawn 
sample to deviate by a huge amount from the mean for the relevant 
underlying population. Similarly, we would also expect two samples drawn 
from the same population to look a lot like each other. Or, to think about the 
situation somewhat differently, if we have two samples that have extremely 
dissimilar means, the most likely explanation is that they came from 
different populations. 

Here is a quick intuitive example. Suppose your null hypothesis is that 
male professional basketball players have the same mean height as the rest 
of the adult male population. You randomly select a sample of 50 
professional basketball players and a sample of 50 men who do not play 


professional basketball. Suppose the mean height of your basketball sample 
is 6 feet 7 inches, and the mean height of the non-basketball players is 5 
feet 10 inches (a 9-inch difference). What is the probability of observing 
such a large difference in mean height between the two samples if in fact 
there is no difference in average height between professional basketball 
players and all other men in the overall population? The nontechnical 
answer: very, very, very low. 

The autism research paper has the same basic methodology. The paper 
compares several measures of brain size between the samples of children. 
(The brain measurements were done with magnetic resonance imaging at 
age two, and again between ages four and five.) I’ll focus on just one 
measurement, the total brain volume. The researchers’ null hypothesis was 
presumably that there are no anatomical differences in the brains of children 
with and without autism. The alternative hypothesis is that the brains of 
children with autism spectrum disorder are fundamentally different. Such a 
finding would still leave lots of questions, but it would point to a direction 
for further inquiry. 

In this study, the children with autism spectrum disorder had a mean 
brain volume of 1310.4 cubic centimeters; the children in the control group 
had a mean brain volume of 1238.8 cubic centimeters. Thus, the difference 
in average brain volume between the two groups is 71.6 cubic centimeters. 
How likely would this result be if in fact there were no difference in 
average brain size in the general population between children who have 
autism spectrum disorder and children who do not? 

You may recall from the last chapter that we can create a standard error 
for each of our samples: s/JV, where s is the standard deviation of the 
sample and n is the number of observations. The research paper gives us 
these figures. The standard error for the total brain volume of the 59 
children in the autism spectrum disorder sample is 13 cubic centimeters; the 
standard error for the total brain volume of the 38 children in the control 
group is 18 cubic centimeters. You will recall that the central limit theorem 
tells us that for 95 samples out of 100, the sample mean is going to lie 
within two standard errors of the true population mean, in one direction or 
the other. 


As a result, we can infer from our sample that 95 times out of 100 the 
interval of 1310.4 cubic centimeters ± 26 (which is two standard errors) will 
contain the average brain volume for all children with autism spectrum 
disorder. This expression is called a confidence interval. We can say with 95 
percent confidence that the range 1284.4 to 1336.4 cubic centimeters 
contains the average total brain volume for children in the general 
population with autism spectrum disorder. 

Using the same methodology, we can say with 95 percent confidence that 
the interval of 1238.8 ± 36, or between 1202.8 and 1274.8 cubic 
centimeters, will include the average brain volume for children in the 
general population who do not have autism spectrum disorder. 

Yes, there are a lot of numbers here. Perhaps you’ve just hurled the book 
across the room.* If not, or if you then went and retrieved the book, what 
you should notice is that our confidence intervals do not overlap. The lower 
bound of our 95 percent confidence interval for the average brain size of 
children with autism in the general population (1284.4 cubic centimeters) is 
still higher than the upper bound for the 95 percent confidence interval for 
the average brain size for young children in the population without autism 
(1274.8 cubic centimeters), as the following diagram illustrates. 

95% confidence interval 95% confidence interval 

far the non-autism for children with 

general population autism spectrum disorder 

i --\ / --\ 
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This is the first clue that there may be an underlying anatomical 
difference in the brains of young children with autism spectrum disorder. 
Still, it’s just a clue. All of these inferences are based on data from fewer 
than 100 children. Maybe we just have wacky samples. 

One final statistical procedure can bring all this to fruition. If statistics 
were an Olympic event like figure skating, this would be the last program, 
after which elated fans throw bouquets of flowers onto the ice. We can 
calculate the exact probability of observing a difference of means at least 
this large (1310.4 cubic centimeters versus 1238.8 cubic centimeters) if 
there is really no difference in brain size between children with autism 








spectrum and all other children in the general population. We can find a p- 
value for the observed difference in means. 

Lest you hurl the book across the room again, I have put the formula in 
an appendix. The intuition is quite straightforward. If we draw two large 
samples from the same population, we would expect them to have very 
similar means. In fact, our best guess is that they will have identical means. 
For example, if I were to select 100 players from the NBA and they had an 
average height of 6 feet 7 inches, then I would expect another random 
sample of 100 players from the NBA to have a mean height close to 6 feet 7 
inches. Okay, maybe the two samples would be an inch or 2 apart. But it’s 
less likely that the means of the two samples will be 4 inches apart—and 
even less likely that there will be a difference of 6 or 8 inches. It turns out 
that we can calculate a standard error for the difference between two sample 
means; this standard error gives us a measure of the dispersion we can 
expect, on average, when we subtract one sample mean from the other. (As 
noted earlier, the formula is in the chapter appendix.) The important thing is 
that we can use this standard error to calculate the probability that two 
samples come from the same population. Here is how it works: 

1. If two samples are drawn from the same population, our best guess 
for the difference between their means is zero. 

2. The central limit theorem tells us that in repeated samples, the 
difference between the two means will be distributed roughly as a 
normal distribution. (Okay, have you come to love the central limit 
theorem yet or not?) 

3. If the two samples really have come from the same population, then 
in roughly 68 cases out of 100, the difference between the two sample 
means will be within one standard error of zero. And in roughly 95 
cases out of 100, the difference between the two sample means will 
be within two standard errors of zero. And in 99.7 cases out of 100, 
the difference will be within three standard errors of zero—which 
turns out to be what motivates the conclusion in the autism research 
paper that we started with. 

As noted earlier, the difference in the mean brain size between the 
sample of children with autism spectrum disorder and the control group is 



71.6 cubic centimeters. The standard error on that difference is 22.7, 
meaning that the difference in means between the two samples is more than 
three standard errors from zero; we would expect an outcome this extreme 
(or more so) only 2 times in 1,000 if these samples are drawn from an 
identical population. 

In the paper published in the Archives of General Psychiatry, the authors 
report a p-value of .002, as I mentioned earlier. Now you know where it 
came from! 

For all the wonders of statistical inference, there are some significant 
pitfalls. They derive from the example that introduced the chapter: my 
suspicious statistics professor. The powerful process of statistical inference 
is based on probability, not on some kind of cosmic certainty. We don’t 
want to be sending people to jail just for doing the equivalent of drawing 
two royal flushes in a row; it can happen, even if someone is not cheating. 
As a result, we have a fundamental dilemma when it comes to any kind of 
hypothesis testing. 

This statistical reality came to a head in 2011 when the Journal of 
Personality and Social Psychology prepared to publish an academic paper 
that, on the surface, seemed like thousands of other academic papers. 6 A 
Cornell professor explicitly proposed a null hypothesis, conducted an 
experiment to test his null hypothesis, and then rejected the null hypothesis 
at the .05 significance on the basis of the experimental results. The result 
was uproar, in scientific circles as well as mainstream media outlets like the 
New York Times. 

Suffice it to say that articles in the Journal of Personality and Social 
Psychology don’t usually attract big headlines. What exactly made this 
study so controversial? The researcher in question was testing humans’ 
capacity to exercise extrasensory perception, or ESP. The null hypothesis 
was that ESP does not exist; the alternative hypothesis was that humans do 
have extrasensory powers. To study this question, the researcher recruited a 
large sample of participants to examine two “curtains” posted on a 
computer screen. A software program randomly put an erotic photo behind 
one curtain or the other. In repeated trials, study participants were able to 
pick the curtain with the erotic photo behind it 53 percent of the time, 
whereas probability says they would be right only 50 percent of the time. 


Because of the large sample size, the researcher was able to reject the null 
hypothesis that extrasensory perception does not exist and accept instead 
the alternative hypothesis that extrasensory perception can enable 
individuals to sense future events. The decision to publish the paper was 
widely criticized on the grounds that a single statistically significant event 
can easily be a product of chance, especially when there is no other 
evidence corroborating or even explaining the finding. The New York Times 
summarized the critiques: “Claims that defy almost every law of science are 
by definition extraordinary and thus require extraordinary evidence. 
Neglecting to take this into account—as conventional social science 
analyses do—makes many findings look far more significant than they 
really are.” 

One answer to this kind of nonsense would appear to be a more rigorous 
threshold for defining statistical significance, such as .001/ But that creates 
problems of its own. Choosing an appropriate significance level involves an 
inherent trade-off. 

If our burden of proof for rejecting the null hypothesis is too low (e.g., 
.1), we are going to find ourselves periodically rejecting the null hypothesis 
when in fact it is true (as I suspect was the case with the ESP study). In 
statistical parlance, this is known as a Type I error. Consider the example of 
an American courtroom, where the null hypothesis is that a defendant is not 
guilty and the threshold for rejecting that null hypothesis is “guilty beyond 
a reasonable doubt.” Suppose we were to relax that threshold to something 
like “a strong hunch that the guy did it.” This is going to ensure that more 
criminals go to jail—and also more innocent people. In a statistical context, 
this is the equivalent of having a relatively low significance level, such as 
. 1 . 

Well, 1 in 10 is not exactly wildly improbable. Consider this challenge in 
the context of approving a new cancer drug. For every ten drugs that we 
approve with this relatively low burden of statistical proof, one of them 
does not actually work and showed promising results in the trial just by 
chance. (Or, in the courtroom example, for every ten defendants that we 
find guilty, one of them was actually innocent.) A Type I error involves 
wrongly rejecting a null hypothesis. Though the terminology is somewhat 
counterintuitive, this is also known as a “false positive.” Here is one way to 


reconcile the jargon. When you go to the doctor to get tested for some 
disease, the null hypothesis is that you do not have that disease. If the lab 
results can be used to reject the null hypothesis, then you are said to test 
positive. And if you test positive but are not really sick, then it’s a false 
positive. 

In any case, the lower our statistical burden for rejecting the null 
hypothesis, the more likely it is to happen. Obviously we would prefer not 
to approve ineffective cancer drugs, or send innocent defendants to prison. 

But there is a tension here. The higher the threshold for rejecting the null 
hypothesis, the more likely it is that we will fail to reject a null hypothesis 
that ought to be rejected. If we require five eyewitnesses in order to convict 
every criminal defendant, then a lot of guilty defendants are wrongly going 
to be set free. (Of course, fewer innocents will go to prison.) If we adopt a 
.001 significance level in the clinical trials for all new cancer drugs, then we 
will indeed minimize the approval of ineffective drugs. (There is only a 1 in 
1,000 chance of wrongly rejecting the null hypothesis that the drug is no 
more effective than a placebo.) Yet now we introduce the risk of not 
approving many effective drugs because we have set the bar for approval so 
high. This is known as a Type II error, or false negative. 

Which kind of error is worse? That depends on the circumstances. The 
most important point is that you recognize the trade-off. There is no 
statistical “free lunch.” Consider these nonstatistical situations, all of which 
involve a trade-off between Type I and Type II errors. 

1. Spam filters. The null hypothesis is that any particular e-mail 
message is not spam. Your spam filter looks for clues that can be used 
to reject that null hypothesis for any particular e-mail, such as huge 
distribution lists or phrases like “penis enlargement.” A Type I error 
would involve screening out an e-mail message that is not actually 
spam (a false positive). A Type II error would involve letting spam 
through the filter into your inbox (a false negative). Given the costs of 
missing an important e-mail relative to the costs of getting the 
occasional message about herbal vitamins, most people would 
probably err on the side of allowing Type II errors. An optimally 
designed spam filter should require a relatively high degree of 



certainty before rejecting the null hypothesis that an incoming e-mail 
is legitimate and blocking it. 

2. Screening for cancer. We have numerous tests for the early detection 
of cancer, such as mammograms (breast cancer), the PSA test 
(prostate cancer), and even full-body MRI scans for anything else that 
might look suspicious. The null hypothesis for anyone undergoing 
this kind of screening is that no cancer is present. The screening is 
used to reject this null hypothesis if the results are suspicious. The 
assumption has always been that a Type I error (a false positive that 
turns out to be nothing) is far preferable to a Type II error (a false 
negative that misses a cancer diagnosis). Historically, cancer 
screening has been the opposite of the spam example. Doctors and 
patients are willing to tolerate a fair number of Type I errors (false 
positives) in order to avoid the possibility of a Type II error (missing 
a cancer diagnosis). More recently, health policy experts have begun 
to challenge this view because of the high costs and serious side 
effects associated with false positives. 

3. Capturing terrorists. Neither a Type I nor a Type II error is acceptable 
in this situation, which is why society continues to debate about the 
appropriate balance between fighting terrorism and protecting civil 
liberties. The null hypothesis is that an individual is not a terrorist. As 
in a regular criminal context, we do not want to commit a Type I error 
and send innocent people to Guantanamo Bay. Yet in a world with 
weapons of mass destruction, letting even a single terrorist go free (a 
Type II error) can be literally catastrophic. This is why—whether you 
approve of it or not—the United States is holding suspected terrorists 
at Guantanamo Bay on the basis of less evidence than might be 
required to convict them in a regular criminal court. 

Statistical inference is not magic, nor is it infallible, but it is an 
extraordinary tool for making sense of the world. We can gain great insight 
into many life phenomena just by determining the most likely explanation. 
Most of us do this all the time (e.g., “I think that college student passed out 
on the floor surrounded by beer cans has had too much to drink” rather than 
“I think that college student passed out on the floor surrounded by beer cans 
has been poisoned by terrorists”). 



Statistical inference merely formalizes the process. 

APPENDIX TO CHAPTER 9 

Calculating the standard error for a difference of means 

Formula for comparing two means 

x - y -*■ numerator yields the size of the difference in means 

j s 1 + si —► denominator yields the standard error for a difference 
01 % in mean between two samples 

where x = mean for sample x 
y = mean for sample y 
s x = standard deviation for sample x 

s y = standard deviation for sample y 
n x = number of observations in sample x 
n y = number of observations in sample y 

Our null hypothesis is that the two sample means are the same. The formula 
above calculates the observed difference in means relative to the size of the 
standard error for the difference in means. Again, we lean heavily on the 
normal distribution. If the underlying population means are truly the same, 
then we would expect the difference in sample means to be less than one 
standard error about 68 percent of the time; less than two standard errors 
about 95 percent of the time; and so on. 

In the autism example from the chapter, the difference in the mean 
between the two samples was 71.6 cubic centimeters with a standard error 
of 22.7. The ratio of that observed difference is 3.15, meaning that the two 
samples have means that are more than 3 standard errors apart. As noted in 
the chapter, the probability of getting samples with such different means if 
the underlying populations have the same mean is very, very low. 
Specifically, the probability of observing a difference of means that is 3.15 
standard errors or larger is .002. 


Difference in Sample Means 
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Difference in sample means 

One- and Two-Tailed Hypothesis Testing 

This chapter introduced the idea of using samples to test whether male 
professional basketball players are the same height as the general 
population. I finessed one detail. Our null hypothesis is that male basketball 
players have the same mean height as men in the general population. What I 
glossed over is that we have two possible alternative hypotheses. 

One alternative hypothesis is that male professional basketball players 
have a different mean height than the overall male population; they could be 
taller than other men in the population, or shorter. This was the approach 
that you took when you dropped into the hijacked bus and weighed the 
passengers to determine whether they were participants in the Changing 
Lives study. You could reject the null hypothesis that the bus participants 
were part of the study if the passengers’ mean weight was significantly 
higher than the overall mean for Changing Lives participants or if it was 
significantly lower (as turned out to be the case). Our second alternative 
hypothesis is that male professional basketball players are taller on average 
than other men in the population. In this case, the background knowledge 
that we bring to this question tells us that basketball players cannot possibly 
be shorter than the general population. The distinction between these two 
alternative hypotheses will determine whether we do a one-tailed 
hypothesis test or a two-tailed hypothesis test. 

In both cases, let’s assume that we are going to do a significance test at 
the .05 level. We will reject our null hypothesis if we observe a difference 
in heights between the two samples that would occur 5 times in 100 or less 
if all these guys really are the same height. So far, so good. 







Here is where things get a little more nnanced. When our alternative 
hypothesis is that basketball players are taller than other men, we are going 
to do a one-tailed hypothesis test. We will measure the difference in mean 
height between our sample of male basketball players and our sample of 
regular men. We know that if our null hypothesis is true, then we will 
observe a difference that is 1.64 standard errors or greater only 5 times in 
100. We reject our null hypothesis if our result falls in this range, as the 
following diagram shows. 

Difference in Sample Means 
(Measured in Standard Errors) 



Now let’s revisit the other alternative hypothesis—that male basketball 
players could be taller or shorter than the general population. Our general 
approach is the same. Again, we will reject our null hypothesis that 
basketball players are the same height as the general population if we get a 
result that would occur 5 times in 100 or less if there really is no difference 
in heights. The difference, however, is that we must now entertain the 
possibility that basketball players are shorter than the general population. 
We will therefore reject our null hypothesis if our sample of male basketball 
players has a mean height that is significantly higher or lower than the 
mean height for our sample of normal men. This requires a two-tailed 
hypothesis test. The cutoff points for rejecting our null hypothesis will be 
different because we must now account for the possibility of a large 
difference in sample means in both directions: positive or negative. More 








specifically, the range in which we will reject our null hypothesis has been 
split between the two tails. We will still reject our null hypothesis if we get 
an outcome that would occur 5 percent of the time or less if basketball 
players are the same height as the general population; only now we have 
two different ways that we can end up rejecting the null hypothesis. 

We will reject our null hypothesis if the mean height for the sample of 
male basketball players is so much larger than the mean for the normal men 
that we would observe such an outcome only 2.5 times in 100 if basketball 
players are really the same height as everyone else. 

And we will reject our null hypothesis if the mean height for the sample 
of male basketball players is so much smaller than the mean for the normal 
men that we would observe such an outcome only 2.5 times in 100 if 
basketball players are really the same height as everyone else. 

Together, these two contingencies add up to 5 percent, as the diagram 
below illustrates. 


Difference in Sample Means 
(Measured in Standard Errors) 
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Difference in sample means 


Judgment should inform whether a one- or a two-tailed hypothesis is 
more appropriate for the analysis being conducted. 


* As a matter of semantics, we have not proved the null hypothesis to be true (that substance abuse 
treatment has no effect). It may turn out to be extremely effective for another group of prisoners. Or 
perhaps many more of the prisoners in this treatment group would have been rearrested if they had 
not received the treatment. In any case, on the basis of the data collected, we have merely failed to 










reject our null hypothesis. There is a similar distinction between “failing to reject” a null hypothesis 
and accepting that null hypothesis. Just because one study could not disprove that substance abuse 
treatment has no effect (yes, a double negative) does not mean that one must accept that substance 
abuse treatment is useless. There is a meaningful statistical distinction here. That said, research is 
often designed to inform policy, and prison officials, who have to decide where to allocate resources, 
might reasonably accept the position that substance treatment is ineffective until they are persuaded 
otherwise. Here, as in so many other areas of statistics, judgment matters. 

* This example is inspired by real events. Obviously many details have been changed for national 
security reasons. I can neither confirm nor deny my own involvement. 

* To be precise, 95 percent of all sample means will lie within 1.96 standard errors above or below 
the population mean. 

* There are two possible alternative hypotheses. One is that male professional basketball players are 
taller than the overall male population. The other is merely that male professional basketball players 
have a different mean height than the overall male population (leaving open the possibility that male 
basketball players may actually be shorter than other men). This distinction has a small impact when 
one performs significance tests and calculates p-values. It is explained in more advanced texts and is 
not important to our general discussion here. 

* I will admit that I did once tear a statistics book in half out of frustration. 

* Another answer is to attempt to replicate the results in additional studies. 


CHAPTER 10 


Polling 

How we know that 64 percent of 
Americans support the death penalty 
(with a sampling error ± 3 percent) 


In late 2011, the New York Times ran a front-page story reporting that “a 

deep sense of anxiety and doubt about the future hangs over the nation.” 1 
The story delved into the psyche of America, offering insights into public 
opinion on topics ranging from the performance of the Obama 
administration to the distribution of wealth. Here is a snapshot of what 
Americans had to say in the fall of 2011: 

• A shocking 89 percent of Americans said that they distrust 
government to do the right thing, the highest level of distrust ever 
recorded. 

• Two-thirds of the public said that wealth should be more evenly 
distributed in the country. 

• Forty-three percent of Americans said that they generally agreed with 
the views of the Occupy Wall Street movement, an amorphous protest 
movement that began near Wall Street in New York and was 
spreading to other cities around the country.* A slightly higher 
percentage, 46 percent, said that the views of the people involved in 
the Occupy Wall Street movement “generally reflect the views of 
most Americans.” 

• Forty-six percent of Americans approved of Barack Obama’s handling 
of his job as president—and an identical 46 percent disapproved of 
his job performance. 

• A mere 9 percent of the public approved of the way Congress was 
handling its job. 


• Even though the presidential primaries would begin in only two 
months, roughly 80 percent of Republican primary voters said “it was 
still too early to tell whom they will support.” 

These are fascinating figures that provided meaningful insight into 
American opinions one year in advance of a presidential race. Still, one 
might reasonably ask. How do we know all this? How can we draw such 
sweeping conclusions about the attitudes of hundreds of millions of adults? 
And how do we know whether these sweeping conclusions are accurate? 

The answer, of course, is that we conduct polls. Or in the example above, 
the New York Times and CBS News can do a poll. (The fact that two 
competing news organizations would collaborate on a project like this is the 
first clue that conducting a methodologically sound national poll is not 
cheap.) I have no doubt that you are familiar with polling results. It may be 
less obvious that the methodology of polling is just one more form of 
statistical inference. A poll (or survey) is an inference about the opinions of 
some population that is based on the views expressed by some sample 
drawn from that population. 

The power of polling stems from the same source as our previous 
sampling examples: the central limit theorem. If we take a large, 
representative sample of American voters (or any other group), we can 
reasonably assume that our sample will look a lot like the population from 
which it is drawn. If exactly half of American adults disapprove of gay 
marriage, then our best guess about the attitudes of a representative sample 
of 1,000 Americans is that about half of them will disapprove of gay 
marriage. 

Conversely—and more important from the standpoint of polling—if we 
have a representative sample of 1,000 Americans who feel a certain way, 
such as the 46 percent who disapprove of President Obama’s job 
performance, then we can infer from that sample that the general population 
is likely to feel the same way. In fact, we can calculate the probability that 
our sample results will deviate wildly from the true attitudes of the 
population. When you read that a poll has a “margin of error” of ± 3 
percent, this is really just the same kind of 95 percent confidence interval 
that we calculated in the last chapter. Our “95 percent confidence” means 
that if we conducted 100 different polls on samples drawn from the same 



population, we would expect the answers we get from our sample in 95 of 
those polls to be within 3 percentage points in one direction or the other of 
the population’s true sentiment. In the context of the job approval question 
in the New York Times /CBS poll, we can be 95 percent confident that the 
true proportion of all Americans who disapprove of President Obama’s job 
rating lies in the range of 46 percent ± 3 percent, or between 43 percent and 
49 percent. If you read the small print on the New York Times /CBS poll (as I 
urge you to do), that’s pretty much what it says: “In theory, in 19 cases out 
of 20, overall results based on such samples will differ by no more than 3 
percentage points in either direction from what would have been obtained 
by seeking to interview all American adults.” 

One fundamental difference between a poll and other forms of sampling is 
that the sample statistic we care about will be not a mean (e.g., 187 pounds) 
but rather a percentage or proportion (e.g., 47 percent of voters, or .47). In 
other respects, the process is identical. When we have a large, 
representative sample (the poll), we would expect the proportion of 
respondents who feel a certain way in the sample (e.g., the 9 percent who 
think Congress is doing a good job) to be roughly equal to the proportion of 
all Americans who feel that way. This is no different from assuming that the 
mean weight for a sample of 1,000 American men should be roughly equal 
to the mean weight for all American men. Still, we expect some variation in 
the percentage who approve of Congress from sample to sample, just as we 
would expect some variation in mean weight as we took different random 
samples of 1,000 men. If the New York Times and CBS had conducted a 
second poll—asking the same questions to a new sample of 1,000 U.S. 
adults—it is highly unlikely that the results of the second poll would have 
been identical to the results of the first. On the other hand, we should not 
expect the answers from our second sample to diverge widely from the 
answers given by the first. (To return to a metaphor used earlier, if you taste 
a spoonful of soup, stir the pot, and then taste again, the two spoonfuls are 
going to taste similar.) The standard error is what tells us how much 
dispersion we can expect in our results from sample to sample, which in 
this case means poll to poll. 

The formula for calculating a standard error for a percentage or 
proportion is slightly different from the formula introduced earlier; the 



intuition is exactly the same. For any properly drawn random sample, the 
standard error is equal to y P n - P )/n, where p is the proportion of respondents 
expressing a particular view, (1 - p) is the proportion of respondents 
expressing a different view, and n is the total number of respondents in the 
sample. You should see that the standard error will fall as the sample size 
gets larger, since n is in the denominator. The standard error also tends to be 
smaller when p and (1 - p) are far apart. For example, the standard error 
will be smaller for a poll in which 95 percent of respondents express a 
certain view than for a poll in which opinions tend to split 50-50. This is 
just math, since (.05)(.95) = .047, while (.5)(.5) = .25; a smaller number in 
the numerator of the formula leads to a smaller standard error. 

As an example, assume that a simple “exit poll” of 500 representative 
voters on election day finds that 53 percent voted for the Republican 
candidate; 45 percent of voters voted for the Democrat; and 2 percent 
supported a third-party candidate. If we use the Republican candidate as our 
proportion of interest, the standard error for this exit poll would be 

y (.53) (1 - .53)/500 = y(.53)(.47)/500 = J.2 5/500 = /000T = .02236. 

For simplicity, we’ll round the standard error for this exit poll to .02. So 
far, that’s just a number. Let’s work through why that number matters. 
Assume the polls have just closed, and you work for a television network 
that is keen to declare a winner in the race before the full results are 
available. You are now the official network data cruncher (having read two- 
thirds of this book), and your producer wants to know whether it is possible 
to “call the race” on the basis of this exit poll. 

You explain that the answer depends on how confident the network 
people would like to be in the announcement—or, more specifically, what 
risk they are willing to take that they will get it wrong. Remember, the 
standard error gives us a sense of how often we can expect our sample 
proportion (the exit poll) to lie reasonably close to the true population 
proportion (the election outcome). We know that roughly 68 percent of the 
time we can expect the sample proportion—the 53 percent of voters who 
said they voted for the Republican in this case—to be within one standard 
error of the true final tally. As a result, you tell your producer “with 68 
percent confidence” that your sample, which shows the Republican getting 
53 percent of the vote ± 2 percent, or between 51 and 55 percent, has 
captured the Republican candidate’s true tally. Meanwhile, the same exit 



poll shows that the Democratic candidate has received 45 percent of the 
vote. If we assume that the vote tally for the Democratic candidate has the 
same standard error (a simplification that I’ll explain in a minute), we can 
say with 68 percent confidence that the exit poll sample, which shows the 
Democrat with 45 percent of the vote ± 2 percent, or between 43 and 47 
percent, contains the Democrat’s true tally. According to this calculation, 
the Republican is the winner. 

The graphics department rushes to do a fancy three-dimensional image 
that you can flash on the screen for your viewers: 

Republican 53% 

Democrat 45% 

Independent 2% 

(Margin of Error 2%) 

At first, your producer is impressed and excited, in large part because the 
above graphic is 3-D, multicolored, and able to spin around on the screen. 
However, when you explain that roughly 68 times out of 100 your exit poll 
results will be within one standard error of the true election outcome, your 
producer, who has twice been sent by the courts to anger management 
programs, points out the obvious math—32 times out of 100 your exit poll 
will not be within one standard error of the true election outcome. Then 
what? 

You explain that there are two possibilities: (1) the Republican candidate 
could have received even more votes than your poll predicted, in which case 
you still will have called the election correctly. Or (2) there is a reasonably 
high probability that the Democratic candidate has received far more votes 
than your poll has reported, in which case your fancy 3-D, multicolored, 
spinning graphic will have reported the wrong winner. 

Your producer hurls a coffee mug across the room and uses several 
phrases that violate her probation. She screams, “How can we be [deleted] 
sure that we have the right [deleted] result?” 

Ever the statistics guru, you point out that you cannot be certain of any 
result until all of the votes are counted. However, you can offer a 95 percent 
confidence interval instead. In this case, your spinning, 3-D, multicolored 
graphic will be wrong, on average, only 5 times out of 100. 



Your producer lights a cigarette and seems to relax. You decide not to 
mention the ban on smoking in the workplace, as that turned out 
disastrously last time. However, you do share some bad news. The only 
way the station can be more confident of its polling results is by broadening 
the “margin of error.” And when you do that, there is no longer a clear 
winner in the election. You show your boss the new fancy graphic: 

Republican 53% 

Democrat 45% 

Independent 2% 

(Margin of Error 4%) 

We know from the central limit theorem that roughly 95 percent of 
sample proportions will he within two standard errors of the true 
population proportion (which is 4% in this case). Therefore, if we want to 
be more confident of our polling results, we have to be less ambitious in 
what we are predicting. As the above graphic illustrates (without the 3-D 
and color), at the 95 percent confidence level, the television station can 
announce that the Republican candidate has earned 53 percent of the vote ± 
4 percent, or between 49 and 57 percent of the votes cast. Meanwhile, the 
Democratic candidate has earned 45 percent ± 4 percent, or between 41 and 
49 percent of the votes cast. 

And, yes, now you have a new problem. At the 95 percent confidence 
level, you cannot reject the possibility that the two candidates may be tied 
with 49 percent of the vote each. This is an inevitable trade-off; the only 
way to become more certain that your polling results will be consistent with 
the election outcome without new data is to become more timid in your 
prediction. Think about a nonstatistical context. Suppose you tell a friend 
that you are “pretty sure” that Thomas Jefferson was the third or fourth 
president. How can you become more confident of your historical 
knowledge? By being less specific. You are “absolutely positive” that 
Thomas Jefferson was one of the first five presidents. 



Your producer tells you to order a pizza and prepare to stay at work all 
night. At that point, statistical good fortune shines upon you. The results of 
a second exit poll come across your desk with a sample of 2,000 voters. 
These results show the following: Republican (52 percent); Democrat (45 
percent); Independent (3 percent). Your producer is now thoroughly 
exasperated, since this poll suggests that the gap between the candidates has 
narrowed, making it even harder for you to call the race in a timely manner. 
But wait! You point out (heroically) that the sample size (2,000) is four 
times as large as the sample in the first poll. As a result, the standard error 
will shrink significantly. The new standard error for the Republican 
candidate is ; l .52(.48y2,ooo , which is .01. 

If your producer is still comfortable with a 95 percent confidence level, 
you can declare the Republican candidate the winner. With your new .01 
standard error, the 95 percent confidence intervals for the candidates are the 
following: Republican: 52 ± 2, or between 50 and 54 percent of the votes 
cast; Democrat: 45 ± 2, or between 43 and 47 percent of the votes cast. 
There is no longer any overlap between the two confidence intervals. You 
can predict on air that the Republican candidate is the winner; more than 95 
times out of 100 you will be correct. 

But this case is even better than that. The central limit theorem tells us 
that 99.7 percent of the time a sample proportion will be within three 
standard errors of the true population proportion. In this election example, 
our 99.7 percent confidence intervals for the two candidates are the 
following: Republican, 52 ± 3 percent, or between 49 and 55 percent; 
Democrat, 45 + 3 percent, or between 42 and 48 percent. If you report that 
the Republican candidate has won, there is only a tiny chance that you and 
your producer will be fired, thanks to your new 2,000-voter sample. 

You should see that a bigger sample makes for a shrinking standard error, 
which is how large national polls can end up with shockingly accurate 
results. On the other hand, smaller samples obviously make for larger 
standard errors and therefore a larger confidence interval (or “margin of 
sampling error,” to use the polling lingo). The fine print in the New York 
Times /CBS poll points out that the margin of error for the questions about 
the Republican primary is 5 percentage points, compared with 3 percentage 
points for other questions in the poll. Only self-described Republican 



primary and caucus voters were asked these questions, so the sample size 
for this subgroup of questions fell to 455 (compared with 1,650 adults for 
the balance of the poll). 

As usual. I’ve simplified lots of things in this chapter. You might have 
recognized that in my election example above, the Republican and 
Democratic candidates should each have their own standard error. Think 
again about the formula: se = y p<i - P )/n. The size of the sample, n, is the same 
for both candidates, but p and (1 - p) will be slightly different. In the 
second exit poll (with the 2,000-voter sample), the standard error for the 
Republican is J.52(M)/2,ooo =.oni7; for the Democrat, se = y.4S(.55)/2,ooo = . 01112 . 
Of course, for all intents and purposes, those two numbers are the same. For 
that reason, I have adopted a common convention, which is to take the 
higher standard error of the two and use that for all of the candidates. If 
anything, this introduces a little extra caution into our confidence intervals. 

Many national polls that ask multiple questions will go one step further. 
In the case of the New York Times /CBS poll, the standard error should 
technically be different for each question, depending on the response. For 
example, the standard error for the finding that 9 percent of the public 
approves of the way Congress is handling its job should be lower than the 
standard error for the question finding that 46 percent of the public 
approves of the way President Obama has handled his job, since .09 x 
(.91) is less than .46 X (.54)—.0819 versus .2484. (The intuition behind 
this formula is explained in a chapter appendix.) 

Since it would be both confusing and inconvenient to have a different 
standard error for each question, polls of this nature will typically assume 
that the sample proportion for each question is .5 (or 50 percent)— 
generating the largest possible standard error for any given sample size— 
and then adopt that standard error to calculate the margin of sampling error 
for the entire poll.* 

When done properly, polls are uncanny instruments. According to Frank 
Newport, editor in chief of the Gallup Organization, a poll of 1,000 people 
can offer meaningful and accurate insights into the attitudes of the entire 
country. Statistically speaking, he’s right. But to get those meaningful and 
accurate results, we have to conduct a proper poll and then interpret the 
results correctly, both of which are much easier said than done. Bad polling 





results do not typically stem from bad math when calculating the standard 
errors. Bad polling results typically stem from a biased sample, or bad 
questions, or both. The mantra “garbage in, garbage out” applies in spades 
when it comes to sampling public opinion. Below are the key 
methodological questions one ought to ask when conducting a poll, or when 
reviewing the work of others. 

Is this an accurate sample of the population whose opinions we are trying 
to measure? Many common data-related challenges were discussed in 
Chapter 7. Nonetheless, I will point out once again the danger of selection 
bias, particularly self-selection. Any poll that depends on individuals who 
select into the sample, such as a radio call-in show or a voluntary Internet 
survey, will capture only the views of those who make the effort to voice 
their opinions. These are likely to be the people who feel particularly 
strongly about an issue, or those who happen to have a lot of free time on 
their hands. Neither of these groups is likely to be representative of the 
public at large. I once appeared as a guest on a call-in radio show. One of 
the callers to the program declared emphatically on air that my views were 
“so wrong” that he had pulled his car off the highway and found a pay 
phone in order to call the show and register his dissent. I’d like to think that 
the listeners who did not pull their cars off the highway to call the show felt 
differently. 

Any method of gathering opinion that systematically excludes some 
segment of the population is also prone to bias. For example, mobile phones 
have introduced a host of new methodological complexities. Professional 
polling organizations go to great lengths to poll a representative sample of 
the relevant population. The New York Times /CBS poll was based on 
telephone interviews conducted over six days with 1,650 adults, 1,475 of 
whom said they were registered to vote. 

I can only guess at the rest of the methodology, but most professional 
polls use some variation on the following techniques. To ensure that the 
adults who pick up the phone are representative of the population, the 
process starts with probability—a variation on picking marbles out of an 
urn. A computer randomly selects a set of landline telephone exchanges. 
(An exchange is an area code plus the first three digits of a phone number.) 
By randomly choosing from the 69,000 residential exchanges in the 



country, each in proportion to its share of all telephone numbers, the survey 
is likely to get a generally representative geographic distribution of the 
population. As the small print explains, “The exchanges were chosen so as 
to ensure that each region of the country was represented in proportion to its 
share of all telephone numbers.” For each exchange selected, the computer 
added four random digits. As a result, both listed and unlisted numbers will 
end up on the final list of households to be called. The survey also included 
a “random dialing of cell phone numbers.” 

For each number dialed, one adult is designated to be the respondent by a 
“random procedure,” such as asking for the youngest adult who is currently 
at home. This process has been refined to produce a sample of respondents 
that resembles the adult population in terms of age and gender. Most 
important, the interviewer will attempt to make multiple calls at different 
times of day and evening in order to reach each selected phone number. 
These repeated attempts—as many as ten or twelve calls to the same 
number—are an important part of getting an unbiased sample. Obviously it 
would be cheaper and easier to make random calls to different numbers 
until a sufficiently large sample of adults have picked up the phone and 
answered the relevant questions. However, such a sample would be biased 
toward people who are likely to be home and to answer the phone: the 
unemployed, the elderly, and so on. That’s just fine as long as you’re 
willing to qualify your poll results in the following way: President Obama’s 
approval rating stands at 46 percent among the unemployed, old people, and 
others who are eager to answer random phone calls. 

One indicator of a poll’s validity is the response rate: What proportion of 
respondents who were chosen to be contacted ultimately completed the poll 
or survey? A low response rate can be a warning sign for potential sampling 
bias. The more people there are who opt not to answer the poll, or who just 
can’t be reached, the greater the possibility that this large group is different 
in some material way from those who did answer the questions. Pollsters 
can test for “nonresponse bias” by analyzing available data on the 
respondents whom they were not able to contact. Do they live in a 
particular area? Are they refusing to answer for a particular reason? Are 
they more likely to be from a particular racial, ethnic, or income group? 
This kind of analysis can determine whether or not a low response rate will 
affect the results of the poll. 



Have the questions been posed in a way that elicits accurate information 
on the topic of interest? Soliciting public opinion requires more nuance 
than measuring test scores or putting respondents on a scale to determine 
their weight. Survey results can be extremely sensitive to the way a 
question is asked. Let’s take a seemingly simple example: What proportion 
of Americans support capital punishment? As the chapter title suggests, a 
solid and consistent majority of Americans approve of the death penalty. 
According to Gallup, in every year since 2002 over 60 percent of 
Americans have said they favor the death penalty for a person convicted of 
murder. The percentage of Americans supporting capital punishment has 
fluctuated in a relatively narrow range from a high of 70 percent in 2003 to 
a low of 64 percent at several different points. The polling data are clear: 
Americans support the death penalty by a wide margin. 

Or not. American support for the death penalty plummets when life 
imprisonment without parole is offered as an alternative. A 2006 Gallup 
poll found that only 47 percent of Americans judged the death penalty as 
the appropriate penalty for murder, as opposed to 48 percent who preferred 
life imprisonments That’s not just a statistical factoid to amuse guests at a 
dinner party; it means that there is no longer majority support for capital 
punishment when life in prison without parole is a credible alternative. 
When we solicit public opinion, the phrasing of the question and the choice 
of language can matter enormously. 

Politicians will often exploit this phenomenon by using polls and focus 
groups to test “words that work.” For example, voters are more inclined to 
support “tax relief” than “tax cuts,” even though the two phrases describe 
the same thing. Similarly, voters are less concerned about “climate change” 
than they are about “global warming,” even though global warming is a 
form of climate change. Obviously politicians are trying to manipulate 
voters’ responses by choosing nonneutral words. If pollsters are to be 
considered honest brokers generating legitimate results, they must guard 
against language that is prone to affect the accuracy of the information 
collected. Similarly, if answers are to be compared over time—such as how 
consumers feel about the economy today compared with how they felt a 
year ago—then the questions eliciting that information over time must be 
the same, or very similar. 


Polling organizations like Gallup will often conduct “split sample 
testing,” in which variations of a question are tested on different samples to 
gauge how small changes in wording affect respondents’ answers. To 
experts like Gallup’s Frank Newport, the answers to every question present 
meaningful data, even when those answers may appear to be inconsistent. 3 
The fact that American attitudes toward capital punishment change 
dramatically when life without parole is offered as an option tells us 
something important. The key point, says Newport, is to view any polling 
result in context. No single question or poll can capture the full depth of 
public opinion on a complex issue. 

Are respondents telling the truth? Polling is like Internet dating: There is a 
little wiggle room in the veracity of information provided. We know that 
people shade the truth, particularly when the questions asked are 
embarrassing or sensitive. Respondents may overstate their income, or 
inflate the number of times they have sex in a typical month. They may not 
admit that they do not vote. They may hesitate to express views that are 
unpopular or socially unacceptable. For all these reasons, even the most 
carefully designed poll is dependent on the integrity of the respondents’ 
answers. 

Election polls depend crucially on sorting those who will vote on 
Election Day from those who will not. (If we are trying to gauge the likely 
winner of an election, we do not care about the opinions of anyone who is 
not going to vote.) Individuals often say they are going to vote because they 
think that is what pollsters want to hear. Studies that have compared self- 
reported voting behavior to election records consistently find that one- 
quarter to one-third of respondents say they voted when in fact they did 
not. 4 One way to minimize this potential bias is to ask whether a respondent 
voted in the last election, or in the last several elections. Respondents who 
have voted consistently in the past are most likely to vote in the future. 
Similarly, if there are concerns that respondents may be hesitant to express 
a socially unacceptable answer, such as a negative view of a racial or ethnic 
group, the question may be phrased in a more subtle way, such as by asking 
“if people you know” hold such an opinion. 

One of the most sensitive surveys of all time was a study conducted by 
the National Opinion Research Center (NORC) at the University of 


Chicago called “The Social Organization of Sexuality: Sexual Practices in 
the United States,” which quickly became known as the “Sex Study.” 5 The 
formal description of the study included phrases like “the organization of 
the behaviors constituting sexual transactions” and “sexual partnering and 
behavior across the lifecourse.” (I’m not even sure what a “lifecourse” is; 
spell-check says it’s not a word.) I’m oversimplifying when I write that the 
survey sought to document who is doing what to whom—and how often. 
The purpose of the study, which was published in 1995, was not merely to 
enlighten us all about the sexual behavior of our neighbors (though that was 
part of it) but also to gauge how sexual behavior in the United States was 
likely to affect the spread of HIV/AIDS. 

If Americans are hesitant to admit when they don’t vote, you can imagine 
how keen they are to describe their sexual behavior, particularly when it 
may involve illicit activity, infidelity, or just really weird stuff. The Sex 
Study methodology was impressive. The research was based on ninety- 
minute interviews of 3,342 adults chosen to be representative of the U.S. 
adult population. Nearly 80 percent of the selected respondents completed 
the survey, leading the authors to conclude that the findings are an accurate 
reporting of America’s sexual behavior (or at least what we were doing in 
1995). 

Since you’ve suffered through a chapter on polling methodology, and 
now nearly an entire book on statistics, you are entitled to a glimpse at what 
they found (none of which is particularly shocking). As one reviewer noted, 
“There is much less sexual behavior going on than we might think.” 6 

• People generally have sex with others like themselves. Ninety percent 
of couples shared the same race, religion, social class, and general age 
group. 

• The typical respondent was engaging in sexual activity “a few times a 
month,” though there was wide variation. The number of sexual 
partners since age eighteen ranged from zero to over 1,000. 

• Roughly 5 percent of men and 4 percent of women reported some 
sexual activity with a same-gender partner. 

• Eighty percent of respondents had either one sexual partner in the 
previous year or none at all. 


• Respondents with one sexual partner were happier than those with 
none or with multiple partners. 7 

• A quarter of the married men and 10 percent of the married women 
reported having extramarital sexual activity. 

• Most people are doing it the old-fashioned way: vaginal intercourse 
was the most appealing sexual activity for men and women. 

One review of the Sex Study made a simple but potent critique: The 
conclusion that the accuracy of the survey represents the sexual practices of 
adults in the United States “assumes that respondents to the NORC survey 
both mirrored the population from which they were drawn and gave 
accurate answers.” 8 That sentence could also be the takeaway for this entire 
chapter. At first glance, the most suspicious thing about polling is that the 
opinions of so few can tell us about the opinions of so many. But that’s the 
easy part. One of the most basic statistical principles is that a proper sample 
will look like the population from which it is drawn. The real challenge of 
polling is twofold: finding and reaching that proper sample; and eliciting 
information from that representative group in a way that accurately reflects 
what its members believe. 

APPENDIX TO CHAPTER 10 

Why is the standard error larger when 
p (and 1-p) are close to 50 percent? 

Here is the intuition for why the standard error is highest when the 
proportion answering a particular way (p) is near 50 percent (which, just as 
a matter of math, means that 1-p will also be close to 50 percent). Let’s 
imagine that you are conducting two polls in North Dakota. The first poll is 
designed to measure the mix of Republicans and Democrats in the state. 
Assume that the true political mix in the North Dakota population is evenly 
split 50-50 but that your poll finds 60 percent Republicans and 40 percent 
Democrats. Your results are off by 10 percentage points, which is a large 
margin. Yet, you have generated this large error without making an 
unimaginably large data-collecting mistake. You have overcounted the 
Republicans relative to their true incidence in the population by 20 percent 


[(60 - 50)/50]. And in so doing, you have also undercounted the Democrats 
by 20 percent [(40 - 50)/50]. That could happen, even with a decent polling 
methodology. 

Your second poll is designed to measure the fraction of Native 
Americans in the North Dakota population. Assume that the true proportion 
of Native Americans in North Dakota is 10 percent while non-Native 
Americans make up 90 percent of the state population. Now let’s discuss 
how bad your data collecting would have to be in order to produce a poll 
with a sampling error of 10 percentage points. This could happen in two 
ways. First, you could find that 0 percent of the population is Native 
American and 100 percent is non-Native American. Or you could find that 
20 percent of the population is Native American and 80 percent is non- 
Native American. In one case you have missed all of the Native Americans; 
and in the other, you have found double their true incidence in the 
population. These are really bad sampling mistakes. In both cases, your 
estimate is off by 100 percent: either [(0 - 10)/10] or [(20 - 10)/10]. And if 
you missed just 20 percent of the Native Americans—the same degree of 
error that you had in the Republican-Democrat poll—your results would 
find 8 percent Native Americans and 92 percent non-Native Americans, 
which is only 2 percentage points from the true split in the population. 

When p and 1 - p are close to 50 percent, relatively small sampling 
errors are magnified into large absolute errors in the outcome of the poll. 

When either p or 1 - p is closer to zero, the opposite is true. Even 
relatively large sampling errors produce small absolute errors in the 
outcome of the poll. 

The same 20 percent sampling error distorted the outcome of the 
Democrat-Republican poll by 10 percentage points while distorting the 
Native American poll by only 2 percentage points. Since the standard error 
in a poll is measured in absolute terms (e.g., ± 5 percent), the formula 
recognizes that this error is likely to be largest when p and 1 - p are close to 
50 percent. 


* According to its website, “Occupy Wall Street is a people-powered movement that began on 
September 17, 2011, in Liberty Square in Manhattan’s Financial District, and has spread to over 100 
cities in the United States and actions in over 1,500 cities globally. Occupy Wall Street is fighting 
back against the corrosive power of major banks and multinational corporations over the democratic 
process, and the role of Wall Street in creating an economic collapse that has caused the greatest 


recession in generations. The movement is inspired by popular uprisings in Egypt and Tunisia, and 
aims to expose how the richest 1% of people are writing the rules of an unfair global economy that is 
foreclosing on our future.” 

* We would expect the Republican candidate’s true vote tally to be outside of the confidence interval 
of the poll roughly 5 percent of the time. In those cases, his true vote tally would be less than 50 
percent or greater than 54 percent. However, if he gets more than 54 percent of the vote, your station 
has not made an error in declaring him the winner. (You’ve only understated the margin of his 
victory.) As a result, the probability that your poll will cause you to mistakenly declare the 
Republican candidate the winner is only 2.5 percent. 

* The formula for calculating the standard error of a poll that I have introduced here assumes that the 
poll is conducted on a random sample of the population. Sophisticated polling organizations may 
deviate from this sampling method, in which case the formula for calculating the standard error will 
also change slightly. The basic methodology remains the same, however. 


CHAPTER 11 


Regression Analysis 
The miracle elixir 


Can stress on the job kill you? Yes. There is compelling evidence that 
rigors on the job can lead to premature death, particularly of heart disease. 
But it’s not the kind of stress you are probably imagining. CEOs, who must 
routinely make massively important decisions that determine the fate of 
their companies, are at significantly less risk than their secretaries, who 
dutifully answer the phone and perform other tasks as instructed. How can 
that possibly make sense? It turns out that the most dangerous kind of job 
stress stems from having “low control” over one’s responsibilities. Several 
studies of thousands of British civil servants (the Whitehall studies) have 
found that workers who have little control over their jobs—meaning they 
have minimal say over what tasks are performed or how those tasks are 
carried out—have a significantly higher mortality rate than other workers in 
the civil service with more decision-making authority. According to this 
research, it is not the stress associated with major responsibilities that will 
kill you; it is the stress associated with being told what to do while having 
little say in how or when it gets done. 

This is not a chapter about job stress, heart disease, or British civil 
servants. The relevant question regarding the Whitehall studies (and others 
like them) is how researchers can possibly come to such a conclusion. 
Clearly this cannot be a randomized experiment. We cannot arbitrarily 
assign human beings to different jobs, force them to work in those jobs for 
many years, and then measure who dies at the highest rate. (Ethical 
concerns aside, we would presumably wreak havoc on the British civil 
service by randomly distributing jobs.) Instead, researchers have collected 
detailed longitudinal data on thousands of individuals in the British civil 
service; these data can be analyzed to identify meaningful associations. 


such as the connection between “low control” jobs and coronary heart 
disease. 

A simple association is not enough to conclude that certain kinds of jobs 
are bad for your health. If we merely observe that low-ranking workers in 
the British civil service hierarchy have higher rates of heart disease, our 
results would be confounded by other factors. For example, we would 
expect low-level workers to have less education than senior officials in the 
bureaucracy. They may be more likely to smoke (perhaps because of their 
job frustration). They may have had less healthy childhoods, which 
diminished their job prospects. Or their lower pay may limit their access to 
health care. And so on. The point is that any study simply comparing health 
outcomes across a large group of British workers—or across any other large 
group—will not really tell us much. Other sources of variation in the data 
are likely to obscure the relationship that we care about. Is “low job 
control” really causing heart disease? Or is it some combination of other 
factors that happen to be shared by people with low job control, in which 
case we may be completely missing the real public health threat. 

Regression analysis is the statistical tool that helps us deal with this 
challenge. Specifically, regression analysis allows us to quantify the 
relationship between a particular variable and an outcome that we care 
about while controlling for other factors. In other words, we can isolate the 
effect of one variable, such as having a certain kind of job, while holding 
the effects of other variables constant. The Whitehall studies used 
regression analysis to measure the health impacts of low job control among 
people who are similar in other ways, such as smoking behavior. (Low-level 
workers do in fact smoke more than their superiors; this explains a 
relatively small amount of the variation in heart disease across the 
Whitehall hierarchy.) 

Most of the studies that you read about in the newspaper are based on 
regression analysis. When researchers conclude that children who spend a 
lot of time in day care are more prone to behavioral problems in elementary 
school than children who spend that time at home, the study has not 
randomly assigned thousands of infants either to day care or to home care 
with a parent. Nor has the study simply compared the elementary school 
behavior of children who had different early childhood experiences without 
recognizing that these populations are likely to be different in other 



fundamental ways. Different families make different child care decisions 
because they are different Some households have two parents present; 
some don’t. Some have two parents working; some don’t. Some households 
are wealthier or more educated than others. All of these things affect child 
care decisions, and they affect how children in those families will perform 
in elementary school. When done properly, regression analysis can help us 
estimate the effects of day care apart from other things that affect young 
children: family income, family structure, parental education, and so on. 

Now, there are two key phrases in that last sentence. The first is “when 
done properly.” Given adequate data and access to a personal computer, a 
six-year-old could use a basic statistics program to generate regression 
results. Personal computing has made the mechanics of regression analysis 
almost effortless. The problem is that the mechanics of regression analysis 
are not the hard part; the hard part is determining which variables ought to 
be considered in the analysis and how that can best be done. Regression 
analysis is like one of those fancy power tools. It is relatively easy to use, 
but hard to use well—and potentially dangerous when used improperly. 

The second important phrase above is “help us estimate.” Our child care 
study does not give us a “right” answer for the relationship between day 
care and subsequent school performance. Instead, it quantifies the 
relationship observed for a particular group of children over a particular 
stretch of time. Can we draw conclusions that might apply to the broader 
population? Yes, but we will have the same limitations and qualifications as 
we do with any other kind of inference. First, our sample has to be 
representative of the population that we care about. A study of 2,000 young 
children in Sweden will not tell us much about the best policies for early 
childhood education in rural Mexico. And second, there will be variation 
from sample to sample. If we do multiple studies of children and child care, 
each study will produce slightly different findings, even if the 
methodologies are all sound and similar. 

Regression analysis is similar to polling. The good news is that if we 
have a large representative sample and solid methodology, the relationship 
we observe for our sample data is not likely to deviate wildly from the true 
relationship for the whole population. If 10,000 people who exercise three 
or more times a week have sharply lower rates of cardiovascular disease 
than 10,000 people who don’t exercise (but are similar in all other 



important respects), then the chances are pretty good that we will see a 
similar association between exercise and cardiovascular health for the 
broader population. That’s why we do these studies. (The point is not to tell 
those nonexercisers who are sick at the end of the study that they should 
have exercised.) 

The bad news is that we are not proving definitively that exercise 
prevents heart disease. We are instead rejecting the null hypothesis that 
exercise has no association with heart disease, on the basis of some 
statistical threshold that was chosen before the study was conducted. 
Specifically, the authors of the study would report that if exercise is 
unrelated to cardiovascular health, the likelihood of observing such a 
marked difference in heart disease between the exercisers and nonexercisers 
in this large sample would be less than 5 in 100, or below some other 
threshold for statistical significance. 

Let’s pause for a moment and wave our first giant yellow flag. Suppose 
that this particular study compared a large group of individuals who play 
squash regularly with those of an equal-sized group who get no exercise at 
all. Playing squash does provide a good cardiovascular workout. However, 
we also know that squash players tend to be affluent enough to belong to 
clubs with squash courts. Wealthy individuals may have great access to 
health care, which can also improve cardiovascular health. If our analysis is 
sloppy, we may attribute health benefits to playing squash when in fact the 
real benefit comes from being wealthy enough to play squash (in which 
case playing polo would also be associated with better heart health, even 
though the horse is doing most of the work). 

Or perhaps causality goes the other direction. Could having a healthy 
heart “cause” exercise? Yes. Individuals who are infirm, particularly those 
who have some incipient form of heart disease, will find it much harder to 
exercise. They will certainly be less likely to play squash regularly. Again, 
if the analysis is sloppy or oversimplified, the claim that exercise is good 
for your health may simply reflect the fact that people who start out 
unhealthy find it hard to exercise. In this case, playing squash doesn’t make 
anyone healthier; it merely separates the healthy from the unhealthy. 

There are so many potential regression pitfalls that I’ve devoted the next 
chapter to the most egregious errors. For now, we’ll focus on what can go 
right. Regression analysis has the amazing capacity to isolate a statistical 



relationship that we care about, such as that between job control and heart 
disease, while taking into account other factors that might confuse the 
relationship. 

How exactly does this work? If we know that low-level British civil 
servants smoke more than their superiors, how can we discern which part of 
their poor cardiovascular health is due to their low-level jobs, and which 
part is due to the smoking? These two factors seem inextricably 
intertwined. 

Regression analysis (done properly!) can untangle them. To explain the 
intuition, I need to begin with the basic idea that underlies all forms of 
regression analysis—from the simplest statistical relationships to the 
complex models cobbled together by Nobel Prize winners. At its core, 
regression analysis seeks to find the “best fit” for a linear relationship 
between two variables. A simple example is the relationship between height 
and weight. People who are taller tend to weigh more—though that is 
obviously not always the case. If we were to plot the heights and weights of 
a group of graduate students, you might recall what it looked like from 
Chapter 4: 


Scatter Plot for Height and Weight 



Height (inches) 


If you were asked to describe the pattern, you might say something along 
the lines of “Weight seems to increase with height.” This is not a terribly 
insightful or specific statement. Regression analysis enables us to go one 



step further and “fit a line” that best describes a linear relationship between 
the two variables. 

Many possible lines are broadly consistent with the height and weight 
data. But how do we know which is the best line for these data? In fact, 
how exactly would we define “best”? Regression analysis typically uses a 
methodology called ordinary least squares, or OLS. The technical details, 
including why OLS produces the best fit, will have to be left to a more 
advanced book. The key point lies in the “least squares” part of the name; 
OLS fits the line that minimizes the sum of the squared residuals. That’s not 
as awfully complicated as it sounds. Each observation in our height and 
weight data set has a residual, which is its vertical distance from the 
regression line, except for those observations that lie directly on the line, for 
which the residual equals zero. (On the diagram below, the residual is 
marked for a hypothetical person A.) It should be intuitive that the larger 
the sum of residuals overall, the worse the fit of the line. The only 
nonintuitive twist with OLS is that the formula takes the square of each 
residual before adding them all up (which increases the weight given to 
observations that lie particularly far from the regression line, or the 
“outliers”). 

Ordinary least squares “fits” the line that minimizes the sum of the 
squared residuals, as illustrated below. 

Line of Best Fit for Height and Weight 



Height (inches) 



If the technical details have given you a headache, you can be forgiven 
for just grasping at the bottom line, which is that ordinary least squares 
gives us the best description of a linear relationship between two variables. 
The result is not only a line but, as you may recall from high school 
geometry, an equation describing that line. This is known as the regression 
equation, and it takes the following form: y = a + bx, where y is weight in 
pounds; a is the y-intercept of the line (the value for y when x = 0); b is the 
slope of the line; and x is height in inches. The slope of the line we’ve 
fitted, b, describes the “best” linear relationship between height and weight 
for this sample, as defined by ordinary least squares. 

The regression line certainly does not describe every observation in the 
data set perfectly. But it is the best description we can muster for what is 
clearly a meaningful relationship between height and weight. It also means 
that every observation can be explained as WEIGHT = a + b(HEIGHT) + e, 
where e is a “residual” that catches the variation in weight for each 
individual that is not explained by height. Finally, it means that our best 
guess for the weight of any person in the data set would be a + b(HEIGHT). 
Even though most observations do not lie exactly on the regression line, the 
residual still has an expected value of zero since any person in our sample is 
just as likely to weigh more than the regression equation predicts as he is to 
weigh less. 

Enough of this theoretical jargon! Let’s look at some real height and 
weight data from the Changing Lives study, though I should first clarify 
some basic terminology. The variable that is being explained—weight in 
this case—is known as the dependent variable (because it depends on other 
factors). The variables that we are using to explain our dependent variable 
are known as explanatory variables since they explain the outcome that we 
care about. (Just to make things hard, the explanatory variables are also 
sometimes called independent variables or control variables.) Let’s start by 
using height to explain weight among the Changing Lives participants; later 
we’ll add other potential explanatory factors. There are 3,537 adult 
participants in the Changing Lives study. This is our number of 
observations, or n. (Sometimes a research paper might note that n = 3,537.) 
When we run a simple regression on the Changing Lives data with weight 


as the dependent variable and height as the only explanatory variable, we 
get the following results: 

WEIGHT = -135 + (4.5) X HEIGHT IN INCHES 

a = -135. This is the y-intercept, which has no particular meaning on its 
own. (If you interpret it literally, a person who measures zero inches would 
weigh negative 135 pounds; obviously this is nonsense on several levels.) 
This figure is also known as the constant, because it is the starting point for 
calculating the weight of all observations in the study. 

b = 4.5. Our estimate for b, 4.5, is known as a regression coefficient, or 
in statistics jargon, “the coefficient on height,” because it gives us the best 
estimate of the relationship between height and weight among the Changing 
Lives participants. The regression coefficient has a convenient 
interpretation: a one-unit increase in the independent variable (height) is 
associated with an increase of 4.5 units in the dependent variable (weight). 
For our data sample, this means that a 1-inch increase in height is 
associated with a 4.5 pound increase in weight. Thus, if we had no other 
information, our best guess for the weight of a person who is 5 feet 10 
inches tall (70 inches) in the Changing Lives study would be -135 + 4.5 
(70) = 180 pounds. 

This is our payoff, as we have now quantified the best fit for the linear 
relationship between height and weight for the Changing Lives participants. 
The same basic tools can be used to explore more complex relationships 
and more socially significant questions. For any regression coefficient, you 
will generally be interested in three things: sign, size, and significance. 

Sign. The sign (positive or negative) on the coefficient for an 
independent variable tells us the direction of its association with the 
dependent variable (the outcome we are trying to explain). In the simple 
case above, the coefficient on height is positive. Taller people tend to weigh 
more. Some relationships will work in the other direction. I would expect 
the association between exercise and weight to be negative. If the Changing 
Lives study included data on something like “miles run per month,” I am 
fairly certain that the coefficient on “miles run” would be negative. Running 
more is associated with weighing less. 



Size. How big is the observed effect between the independent variable 
and the dependent variable? Is it of a magnitude that matters? In this case, 
every one inch in height is associated with 4.5 pounds, which is a sizable 
percentage of a typical person’s body weight. In an explanation of why 
some people weigh more than others, height is clearly an important factor. 
In other studies, we may find an explanatory variable that has a statistically 
significant impact on our outcome of interest—meaning that the observed 
effect is not likely to be a product of chance—but that effect may be so 
small as to be trivial or socially insignificant For example, suppose that we 
are examining determinants of income. Why do some people make more 
money than others? The explanatory variables are likely to be things like 
education, years of work experience, and so on. In a large data set, 
researchers might also find that people with whiter teeth earn $86 more per 
year than other workers, ceteris paribus. (“Ceteris paribus” comes from the 
Latin meaning “other things being equal.”) The positive and statistically 
significant coefficient on the “white teeth” variable assumes that the 
individuals being compared are similar in other respects: same education, 
same work experience, and so on. (I will explain in a moment how we pull 
off this tantalizing feat.) Our statistical analysis has demonstrated that 
whiter teeth are associated with $86 in additional annual income per year 
and that this finding is not likely to be a mere coincidence. This means (1) 
we’ve rejected the null hypothesis that really white teeth have no 
association with income with a high degree of confidence; and (2) if we 
analyze other data samples, we are likely to find a similar relationship 
between good-looking teeth and higher income. 

So what? We’ve found a statistically significant result, but not one that is 
particularly meaningful. To begin with, $86 per year is not a life-changing 
sum of money. From a public policy standpoint, $86 is also probably less 
than it would cost to whiten an individual’s teeth every year, so we can’t 
even recommend that young workers make such an investment. And, 
although I’m getting a chapter ahead of myself. I’d also be worried about 
some serious methodological problems. For example, having perfect teeth 
may be associated with other personality traits that explain the earnings 
advantage; the earnings effect may be caused by the kind of people who 
care about their teeth, not the teeth themselves. For now, the point is that we 



should take note of the size of the association that we observe between the 
explanatory variable and our outcome of interest. 

Significance. Is the observed result an aberration based on a quirky 
sample of data, or does it reflect a meaningful association that is likely to be 
observed for the population as a whole? This is the same basic question that 
we have been asking for the last several chapters. In the context of height 
and weight, do we think that we would observe a similar positive 
association in other samples that are representative of the population? To 
answer this question, we use the basic tools of inference that have already 
been introduced. Our regression coefficient is based on an observed 
relationship between height and weight for a particular sample of data. If 
we were to test another large sample of data, we would almost certainly get 
a slightly different association between height and weight and therefore a 
different coefficient. The relationship between height and weight observed 
in the Whitehall data (the British civil servants) is likely to be different 
from the relationship observed between height and weight for the 
participants in the Changing Lives study. However, we know from the 
central limit theorem that the mean for a large, properly drawn sample will 
not typically deviate wildly from the mean for the population as a whole. 
Similarly, we can assume that the observed relationship between variables 
like height and weight will not typically bounce around wildly from sample 
to sample, assuming that these samples are large and properly drawn from 
the same population. 

Think about the intuition: It’s highly unlikely (though still possible) that 
we would find that every inch of height is associated with 4.5 additional 
pounds among the Changing Lives participants but that there is no 
association between height and weight in a different representative sample 
of 3,000 adult Americans. 

This should give you the first inkling of how we’ll test whether our 
regression results are statistically significant or not. As with polling and 
other forms of inference, we can calculate a standard error for the 
regression coefficient. The standard error is a measure of the likely 
dispersion we would observe in the coefficient if we were to conduct the 
regression analysis on repeated samples drawn from the same population. If 
we were to measure and weigh a different sample of 3,000 Americans, we 
might find in the subsequent analysis that each inch of height is associated 



with 4.3 pounds. If we did it again for another sample of 3,000 Americans, 
we might find that each inch is associated with 5.2 pounds. Once again, the 
normal distribution is our friend. For large samples of data, such as our 
Changing Lives data set, we can assume that our various coefficients will 
be distributed normally around the “true” association between height and 
weight in the American adult population. On that assumption, we can 
calculate a standard error for the regression coefficient that gives us a sense 
of how much dispersion we should expect in the coefficients from sample to 
sample. I will not delve into the formula for calculating the standard error 
here, both because it will take us off in a direction that involves a lot of 
math and because all basic statistical packages will calculate it for you. 

However, I must warn that when we are working with a small sample of 
data—such as a group of 20 adults rather than the 3,000+ persons in the 
Changing Lives study—the normal distribution is no longer willing to be 
our friend. Specifically, if we repeatedly conduct regression analysis on 
different small samples, we can no longer assume that our various 
coefficients will be distributed normally around the “true” association 
between height and weight in the American adult population. Instead, our 
coefficients will still be distributed around the “true” association between 
height and weight for the American adult population in what is known as a 
t-distribution. (Basically the t-distribution is more dispersed than the normal 
distribution and therefore has “fatter tails.”) Nothing else changes; any 
basic statistical software package will easily manage the additional 
complexity associated with using the t-distributions. For this reason, the t- 
distribution will be explained in greater detail in the chapter appendix. 

Sticking with large samples for now (and the normal distribution), the 
most important thing to understand is why the standard error matters. As 
with polling and other forms of inference, we expect that more than half of 
our observed regression coefficients will lie within one standard error of the 
true population parameter. Roughly 95 percent will lie within two standard 
errors. And so on. With that, we’re just about home, because now we can do 
a little hypothesis testing. (Seriously, did you think you were already done 
with hypothesis testing?) Once we have a coefficient and a standard error, 
we can test the null hypothesis that there is in fact no relationship between 


the explanatory variable and the dependent variable (meaning that the true 
association between the two variables in the population is zero). 

In our simple height and weight example, we can test how likely it is that 
we would find in our Changing Lives sample that every inch of height is 
associated with 4.5 pounds if there is really no association between height 
and weight in the general population. I’ve run the regression by using a 
basic statistics program; the standard error on the height coefficient is .13. 
This means that if we were to do this analysis repeatedly—say with 100 
different samples—then we would expect our observed regression 
coefficient to be within two standard errors of the true population parameter 
roughly 95 times out of 100. 

We can therefore express our results in two different but related ways. 
First, we can build a 95 percent confidence interval. We can say that 95 
times out of 100, we expect our confidence interval, which is 4.5 ± .26, to 
contain the true population parameter. This is the range between 4.24 and 
4.76. A basic statistics package will calculate this interval as well. Second, 
we can see that our 95 percent confidence interval for the true association 
between height and weight does not include zero. Thus, we can reject the 
null hypothesis that there is no association between height and weight for 
the general population at the 95 percent confidence level. This result can 
also be expressed as being statistically significant at the .05 level; there is 
only a 5 percent chance that we are wrongly rejecting the null hypothesis. 

In fact, our results are even more extreme than that. The standard error 
(.13) is extremely low relative to the size of the coefficient (4.5). One rough 
rule of thumb is that the coefficient is likely to be statistically significant 
when the coefficient is at least twice the size of the standard error." A 
statistics package also calculates a p-value, which is .000 in this case, 
meaning that there is essentially zero chance of getting an outcome as 
extreme as what we’ve observed (or more so) if there is no true association 
between height and weight in the general population. Remember, we have 
not proved that taller people weigh more in the general population; we have 
merely shown that our results for the Changing Lives sample would be 
highly anomalous if that were not the case. 

Our basic regression analysis produces one other statistic of note: the R 2 , 
which is a measure of the total amount of variation explained by the 


regression equation. We know that we have a broad variation in weight in 
our Changing Lives sample. Many of the persons in the sample weigh more 
than the mean for the group overall; many weigh less. The R 2 tells us how 
much of that variation around the mean is associated with differences in 
height alone. The answer in our case is .25, or 25 percent. The more 
significant point may be that 75 percent of the variation in weight for our 
sample remains unexplained. There are clearly factors other than height that 
might help us understand the weights of the Changing Lives participants. 
This is where things get more interesting. 

I’ll admit that I began this chapter by selling regression analysis as the 
miracle elixir of social science research. So far all I’ve done is use a 
statistics package and an impressive data set to demonstrate that tall people 
tend to weigh more than short people. A short trip to a shopping mall would 
probably have convinced you of the same thing. Now that you understand 
the basics, we can unleash the real power of regression analysis. It’s time to 
take off the training wheels! 

As I’ve promised, regression analysis allows us to unravel complex 
relationships in which multiple factors affect some outcome that we care 
about, such as income, or test scores, or heart disease. When we include 
multiple variables in the regression equation, the analysis gives us an 
estimate of the linear association between each explanatory variable and the 
dependent variable while holding other dependent variables constant, or 
“controlling for” these other factors. Let’s stick with weight for a while. 
We’ve found an association between height and weight; we know there are 
other factors that can help to explain weight (age, sex, diet, exercise, and so 
on). Regression analysis (often called multiple regression analysis when 
more than one explanatory variable is involved, or multivariate regression 
analysis) will give us a coefficient for each explanatory variable included in 
the regression equation. In other words, among people who are the same sex 
and height, what is the relationship between age and weight? Once we have 
more than one explanatory variable, we can no longer plot these data in two 
dimensions. (Try to imagine a graph that represents the weight, sex, height, 
and age of each participant in the Changing Lives study.) Yet the basic 
methodology is the same as in our simple height and weight example. As 
we add explanatory variables, a statistical package will calculate the 



regression coefficients that minimize the total sum of the squared residuals 
for the regression equation. 

Let’s work with the Changing Lives data for now; then I’ll go back and 
give an intuitive explanation for how this statistical parting of the Red Sea 
could possibly work. We can start by adding one more variable to the 
equation that explains the weights of the Changing Lives participants: age. 
When we run the regression including both height and age as explanatory 
variables for weight, here is what we get. 

WEIGHT = -145 + 4.6 X (HEIGHT IN INCHES) + .1 X (AGE IN 

YEARS) 

The coefficient on age is .1. That can be interpreted to mean that every 
additional year in age is associated with .1 additional pounds in weight, 
holding height constant. For any group of people who are the same height, 
on average those who are ten years older will weigh one pound more. This 
is not a huge effect, but it’s consistent with what we tend to see in life. The 
coefficient is significant at the .05 level. 

You may have noticed that the coefficient on height has increased 
slightly. Once age is in our regression, we have a more refined 
understanding of the relationship between height and weight. Among 
people who are the same age in our sample, or “holding age constant,” 
every additional inch in height is associated with 4.6 pounds in weight. 

Let’s add one more variable: sex. This will be slightly different because 
sex can only take on two possibilities, male or female. How does one put M 
or F into a regression? The answer is that we use what is called a binary 
variable, or dummy variable. In our data set, we enter a 1 for those 
participants who are female and a 0 for those who are male. (This is not 
meant to be a value judgment.) The sex coefficient can then be interpreted 
as the effect on weight of being female, ceteris paribus. The coefficient is - 
4.8, which is not surprising. We can interpret that to mean that for 
individuals who are the same height and age, women typically weigh 4.8 
pounds less than men. Now we can begin to see some of the power of 
multiple regression analysis. We know that women tend to be shorter than 
men, but our coefficient takes this into account since we have already 



controlled for height. What we have isolated here is the effect of being 
female. The new regression becomes: 

WEIGHT = -118 + 4.3 X (HEIGHT IN INCHES) + .12 (AGE IN 
YEARS) - 4.8 (IF SEX IS FEMALE) 

Our best estimate of the weight of a fifty-three-year-old woman who is 5 
feet 5 inches is: -118 + 4.3 (65) + .12 (53) - 4.8 = 163 pounds. 

And our best guess for a thirty-five-year-old male who is 6 feet 3 inches 
is -118 + 4.3 (75) + .12 (35) = 209 pounds. We skip the last term in our 
regression result (-4.8) since this person is not female. 

Now we can start to test things that are more interesting and less 
predictable. What about education? How might that affect weight? I would 
hypothesize that better-educated individuals are more health conscious and 
therefore will weigh less, ceteris paribus. We also haven’t tested any 
measure of exercise; I assume that, holding other factors constant, the 
people in the sample who get more exercise will weigh less. 

What about poverty? Does being low-income in America have effects 
that manifest themselves in weight? The Changing Lives study asks 
whether the participants are receiving food stamps, which is a good 
measure of poverty in America. Finally, I’m interested in race. We know 
that people of color have different life experiences in the United States 
because of their race. There are cultural and residential factors associated 
with race in America that have implications for weight. Many cities are still 
characterized by a high degree of racial segregation; African Americans 
might be more likely than other residents to live in “food deserts,” which 
are areas with limited access to grocery stores that carry fruits, vegetables, 
and other fresh produce. 

We can use regression analysis to separate out the independent effect of 
each of the potential explanatory factors described above. For example, we 
can isolate the association between race and weight, holding constant other 
socioeconomic factors like educational background and poverty. Among 
people who are high school graduates and eligible for food stamps, what is 
the statistical association between weight and being black? 



At this point, our regression equation is so long that it would be 
cumbersome to print the results in their entirety here. Academic papers 
typically insert large tables that summarize the results of various regression 
equations. I have included a table with the complete results of this 
regression equation in the appendix to this chapter. In the meantime, here 
are the highlights of what happens when we add education, exercise, 
poverty (as measured by receiving food stamps), and race to our equation. 

All of our original variables (height, age, and sex) are still significant. 
The coefficients change little as we add explanatory variables. All of our 
new variables are statistically significant at the .05 level. The R 2 on the 
regression has climbed from .25 to .29. (Remember, an R 2 of zero means 
that our regression equation does no better than the mean at predicting the 
weight of any individual in the sample; an R 2 of 1 means that the regression 
equation perfectly predicts the weight of every person in the sample.) A lot 
of the variation in weight across individuals remains unexplained. 

Education turns out to be negatively associated with weight, as I had 
hypothesized. Among participants in the Changing Lives study, each year of 
education is associated with -1.3 pounds. 

Not surprisingly, exercise is also negatively associated with weight. The 
Changing Lives study includes an index that evaluates each participant in 
the study on his or her level of physical activity. Those individuals who are 
in the bottom quintile of physical activity weigh, on average, 4.5 pounds 
more than other adults in the sample, ceteris paribus. Those in the bottom 
quintile for physical activity weigh, on average, nearly 9 pounds more than 
adults in the top quintile for physical activity. 

Individuals receiving food stamps (the proxy for poverty in this 
regression) are heavier than other adults. Food stamp recipients weigh an 
average of 5.6 pounds more than other Changing Lives participants, ceteris 
paribus. 

The race variable turns out to be particularly interesting. Even after one 
controls for all the other variables described up to this point, race still 
matters a lot when it comes to explaining weight. The non-Hispanic black 
adults in the Changing Lives sample weigh, on average, roughly 10 pounds 
more than the other adults in the sample. Ten pounds is a lot of weight, both 
in absolute terms and compared with the effects of the other explanatory 



variables in the regression equation. This is not a quirk of the data. The p- 
value on the dummy variable for non-Hispanic blacks is .000 and the 95 
percent confidence interval stretches from 7.7 pounds to 16.1 pounds. 

What is going on? The honest answer is that I have no idea. Let me 
reiterate a point that was buried earlier in a footnote: I’m just playing 
around with data here to illustrate how regression analysis works. The 
analytics presented here are to true academic research what street hockey is 
to the NHL. If this were a real research project, there would be weeks or 
months of follow-on analysis to probe this finding. What I can say is that I 
have demonstrated why multiple regression analysis is the best tool we have 
for finding meaningful patterns in large, complex data sets. We started with 
a ridiculously banal exercise: quantifying the relationship between height 
and weight. Before long, we were knee-deep in issues with real social 
significance. 

In that vein, I can offer you a real study that used regression analysis to 
probe a socially significant issue: gender discrimination in the workplace. 
The curious thing about discrimination is that it’s hard to observe directly. 
No employer ever states explicitly that someone is being paid less because 
of his or her race or gender or that someone has not been hired for 
discriminatory reasons (which would presumably leave the person in a 
different job with a lower salary). Instead, what we observe are gaps in pay 
by race and gender that may be the result of discrimination: whites earn 
more than blacks; men earn more than women; and so on. The 
methodological challenge is that these observed gaps may also be the result 
of underlying differences in workers that have nothing to do with workplace 
discrimination, such as the fact that women tend to choose more part-time 
work. How much of the wage gap is due to factors associated with 
productivity on the job, and how much of the gap, if any, is due to labor 
force discrimination? No one can claim that this is a trivial question. 

Regression analysis can help us answer it. However, our methodology 
will be slightly more roundabout than it was with our analysis explaining 
weight. Since we cannot measure discrimination directly, we will examine 
other factors that traditionally explain wages, such as education, experience, 
occupational field, and so on. The case for discrimination is circumstantial: 
If a significant wage gap remains after controlling for other factors that 
typically explain wages, then discrimination is a likely culprit. The larger 



the unexplained portion of any wage gap, the more suspicious we should 
be. As an example, let’s look at a paper by three economists examining the 
wage trajectories of a sample of roughly 2,500 men and women who 
graduated with MBAs from the Booth School of Business at the University 
of Chicago. 1 Upon graduation, male and female graduates have very similar 
average starting salaries: $130,000 for men and $115,000 for women. After 
ten years in the workforce, however, a huge gap has opened up; women on 
average are earning a striking 45 percent less than their male classmates: 
$243,000 versus $442,000. In a broader sample of more than 18,000 MBA 
graduates who entered the workforce between 1990 and 2006, being female 
is associated with 29 percent lower earnings. What is happening to women 
once they enter the labor force? 

According to the authors of the study (Marianne Bertrand of the Booth 
School of Business and Claudia Goldin and Lawrence Katz of Harvard), 
discrimination is not a likely explanation for most of the gap. The gender 
wage gap fades away as the authors add more explanatory variables to the 
analysis. For example, men take more finance classes in the MBA program 
and graduate with higher grade point averages. When these data are 
included as control variables in the regression equation, the unexplained 
portion of the gap in male-female earnings drops to 19 percent. When 
variables are added to the equation to account for post-MBA work 
experience, particularly spells out of the labor force, the unexplained 
portion of the male-female wage gap drops to 9 percent. And when 
explanatory variables are added for other work characteristics, such as 
employer type and hours worked, the unexplained portion of the gender 
wage gap falls to under 4 percent. 

For workers who have been in the labor force more than ten years, the 
authors can ultimately explain all but 1 percent of the gender wage gap with 
factors unrelated to discrimination on the job. They conclude, “We identify 
three proximate reasons for the large and rising gender gap in earnings: 
differences in training prior to MBA graduation; differences in career 
interruptions; and differences in weekly hours. These three determinants 
can explain the bulk of gender differences across the years following MBA 
completion.” 


I hope that I’ve convinced you of the value of multiple regression analysis, 
particularly the research insights that stem from being able to isolate the 
effect of one explanatory variable while controlling for other confounding 
factors. I have not yet provided an intuitive explanation for how this 
statistical “miracle elixir” works. When we use regression analysis to 
evaluate the relationship between education and weight, ceteris paribus, 
how does a statistical package control for factors like height, sex, age, and 
income when we know that our Changing Lives participants are not 
identical in these other respects? 

To get your mind around how we can isolate the effect on weight of a 
single variable, say, education, imagine the following situation. Assume that 
all of the Changing Lives participants are convened in one place—say, 
Framingham, Massachusetts. Now assume that the men and women are 
separated. And then assume that both the men and the women are further 
divided by height. There will be a room of six-foot tall men. Next door, 
there will be a room of 6-foot 1-inch men, and so on for both genders. If we 
have enough participants in our study, we can further subdivide each of 
those rooms by income. Eventually we will have lots of rooms, each of 
which contains individuals who are identical in all respects except for 
education and weight, which are the two variables we care about. There 
would be a room of forty-five-year-old 5-foot 5-inch men who earn $30,000 
to $40,000 a year. Next door would be all the forty-five-year-old 5-foot 5- 
inch women who earn $30,000 to $40,000 a year. And so on (and on and 
on). 

There will still be some variation in weight in each room; people who are 
the same sex and height and have the same income will still weigh different 
amounts—though presumably there will be much less variation in weight in 
each room than there is for the overall sample. Our goal now is to see how 
much of the remaining variation in weight in each room can be explained 
by education. In other words, what is the best linear relationship between 
education and weight in each room? 

The final challenge, however, is that we do not want different 
coefficients in each “room.” The whole point of this exercise is to calculate 
a single coefficient that best expresses the relationship between education 
and weight for the entire sample, while holding other factors constant. What 
we would like to calculate is the single coefficient for education that we can 



use in every room to minimize the sum of the squared residuals for all of the 
rooms combined. What coefficient for education minimizes the square of 
the unexplained weight for every individual across all the rooms? That 
becomes our regression coefficient because it is the best explanation of the 
linear relationship between education and weight for this sample when we 
hold sex, height, and income constant. 

As an aside, you can see why large data sets are so useful. They allow us 
to control for many factors while still having many observations in each 
“room.” Obviously a computer can do all of this in a split second without 
herding thousands of people into different rooms. 

Let’s finish the chapter where we started, with the connection between 
stress on the job and coronary heart disease. The Whitehall studies of 
British civil servants sought to measure the association between grade of 
employment and death from coronary heart disease over the ensuing years. 
One of the early studies followed 17,530 civil servants for seven and a half 
years. 2 The authors concluded, “Men in the lower employment grades were 
shorter, heavier for their height, had higher blood pressure, higher plasma 
glucose, smoked more, and reported less leisure-time physical activity than 
men in the higher grades. Yet when allowance was made for the influence 
on mortality of all of these factors plus plasma cholesterol, the inverse 
association between grade of employment and [coronary heart disease] 
mortality was still strong.” The “allowance” they refer to for these other 
known risk factors is done by means of regression analysis. The study 
demonstrates that holding other health factors constant (including height, 
which is a decent proxy for early childhood health and nutrition), working 
in a “low grade” job can literally kill you. 

Skepticism is always a good first response. I wrote at the outset of the 
chapter that “low-control” jobs are bad for your health. That may or may 
not be synonymous with being low on the administrative totem pole. A 
follow-up study using a second sample of 10,308 British civil servants 
sought to drill down on this distinction. 1 The workers were once again 
divided into administrative grades—high, intermediate, and low—only this 
time the participants were also given a fifteen-item questionnaire that 
evaluated their level of “decision latitude or control.” These included 
questions such as “Do you have a choice in deciding how you do your job?” 


and categorical responses (ranging from “never” to “often”) to statements 
such as “I can decide when to take a break.” The researchers found that the 
“low-control” workers were at significantly higher risk of developing 
coronary heart disease over the course of the study than “high-control” 
workers. Yet researchers also found that workers with rigorous job demands 
were at no greater risk of developing heart disease, nor were workers who 
reported low levels of social support on the job. Lack of control seems to be 
the killer, literally. 

The Whitehall studies have two characteristics typically associated with 
strong research. First, the results have been replicated elsewhere. In the 
public health literature, the “low-control” idea evolved into a term known 
as “job strain,” which characterizes jobs with “high psychological workload 
demands” and “low decision latitude.” Between 1981 and 1993, thirty-six 
studies were published on the subject; most found a significant positive 
association between job strain and heart disease. 4 

Second, researchers sought and found corroborating biological evidence 
to explain the mechanism by which this particular kind of stress on the job 
causes poor health. Work conditions that involve rigorous demands but low 
control can cause physiological responses (such as the release of stress- 
related hormones) that increase the risk of heart disease over the long run. 
Even animal research plays a role; low-status monkeys and baboons (who 
bear some resemblance to civil servants at the bottom of the authority 
chain) have physiological differences from their high-status peers that put 
them at greater cardiovascular risk. 5 

All else equal, it’s better not to be a low-status baboon, which is a point I 
try to make to my children as often as possible, particularly my son. The 
larger message here is that regression analysis is arguably the most 
important tool that researchers have for finding meaningful patterns in large 
data sets. We typically cannot do controlled experiments to learn about job 
discrimination or factors that cause heart disease. Our insights into these 
socially significant issues and many others come from the statistical tools 
covered in this chapter. In fact, it would not be an exaggeration to say that a 
high proportion of all important research done in the social sciences over 
the past half century (particularly since the advent of cheap computing 
power) draws on regression analysis. 


Regression analysis supersizes the scientific method; we are healthier, 
safer, and better informed as a result. 

So what could possibly go wrong with this powerful and impressive tool? 
Read on. 

APPENDIX TO CHAPTER 11 

The t-distribution 

Life gets a little trickier when we are doing our regression analysis (or other 
forms of statistical inference) with a small sample of data. Suppose we were 
analyzing the relationship between weight and height on the basis of a 
sample of only 25 adults, rather than using a huge data set like the 
Changing Lives study. Logic suggests that we should be less confident 
about generalizing our results to the entire adult population from a sample 
of 25 than from a sample of 3,000. One of the themes throughout the book 
has been that smaller samples tend to generate more dispersion in 
outcomes. Our sample of 25 will still give us meaningful information, as 
would a sample of 5 or 10—but how meaningful? 

The t-distribution answers that question. If we analyze the association 
between height and weight for repeated samples of 25 adults, we can no 
longer assume that the various coefficients we get for height will be 
distributed normally around the “true” coefficient for height in the adult 
population. They will still be distributed around the true coefficient for the 
whole population, but the shape of that distribution will not be our familiar 
bell-shaped normal curve. Instead, we have to assume that repeated 
samples of just 25 will produce more dispersion around the true population 
coefficient—and therefore a distribution with “fatter tails.” And repeated 
samples of 10 will produce even more dispersion than that—and therefore 
even fatter tails. The t-distribution is actually a series, or “family,” of 
probability density functions that vary according to the size of our sample. 
Specifically, the more data we have in our sample, the more “degrees of 
freedom” we have when determining the appropriate distribution against 
which to evaluate our results. In a more advanced class, you will learn 
exactly how to calculate degrees of freedom; for our purposes, they are 
roughly equal to the number of observations in our sample. For instance, a 


basic regression analysis with a sample of 10 and a single explanatory 
variable has 9 degrees of freedom. The more degrees of freedom we have, 
the more confident we can be that our sample represents the true 
population, and the “tighter” our distribution will be, as the following 
diagram illustrates. 
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When the number of degrees of freedom gets large, the t-distribution 
converges to the normal distribution. That’s why when we are working with 
large data sets, we can use the normal distribution for our assorted 
calculations. 

The t-distribution merely adds nuance to the same process of statistical 
inference that we have been using throughout the book. We are still 
formulating a null hypothesis and then testing it against some observed 
data. If the data we observe would be highly unlikely if the null hypothesis 
were true, then we reject the null hypothesis. The only thing that changes 
with the t-distribution is the underlying probabilities for evaluating the 
observed outcomes. The “fatter” the tail in a particular probability 
distribution (e.g., the t-distribution for eight degrees of freedom), the more 
dispersion we would expect in our observed data just as a matter of chance, 
and therefore the less confident we can be in rejecting our null hypothesis. 

For example, suppose we are running a regression equation, and the null 
hypothesis is that the coefficient on a particular variable is zero. Once we 
get the regression results, we would calculate a t-statistic, which is the ratio 










of the observed coefficient to the standard error for that coefficient.* This t- 
statistic is then evaluated against whatever t-distribution is appropriate for 
the size of the data sample (since this is largely what determines the number 
of degrees of freedom). When the t-statistic is sufficiently large, meaning 
that our observed coefficient is far from what the null hypothesis would 
predict, we can reject the null hypothesis at some level of statistical 
significance. Again, this is the same basic process of statistical inference 
that we have been employing throughout the book. 

The fewer the degrees of freedom (and therefore the “fatter” the tails of 
the relevant t-distribution), the higher the t-statistic will have to be in order 
for us to reject the null hypothesis at some given level of significance. In 
the hypothetical regression example described above, if we had four 
degrees of freedom, we would need a t-statistic of at least 2.13 to reject the 
null hypothesis at the .05 level (in a one-tailed test). 

However, if we have 20,000 degrees of freedom (which essentially 
allows us to use the normal distribution), we would need only a t-statistic of 
1.65 to reject the null hypothesis at the .05 level in the same one-tailed test. 

Regression Equation for Weight 


Variable 

Coefficient 

Standard 

Error 

t-statistic 

p-value (two- 
tailed test) 

95% Confidence 
Interval 

Height 

4.4 

.2 

21.4 

.000 

4.0 to 48 

Age 

.08 

.03 

2.2 

.026 

.01 to 2 

Sex 

-5.7 

1.7 

-3.4 

.001 

-9.0 to -2.4 

Yean of Educational 

Attainment 

-.7 

.2 

-3.5 

.000 

-1.1 to ^3 

Bottom Quintile of 
Physical Activity 

3.7 

1.4 

2.6 

.009 

.9 to 6.5 

Dummy for 

Receiving Food 
Stamps 

5j6 

2.1 

2.7 

.007 

1.5 to 9.7 

Non-Hisparuc Black 

9.7 

1J 

7 2 

.000 

7 JO to 12.3 

Intercept 

-117 






* You should consider this exercise “fun with data” rather than an authoritative exploration of any of 
the relationships described in the subsequent regression equations. The purpose here is to provide an 
intuitive example of how regression analysis works, not to do meaningful research on Americans’ 
weights. 

* “Parameter” is a fancy term for any statistic that describes a characteristic of some population; the 
mean weight for all adult men is a parameter of that population. So is the standard deviation. In the 

















example here, the true association between height and weight for the population is a parameter of that 
population. 

* When the null hypothesis is that a regression coefficient is zero (as is most often the case), the ratio 
of the observed regression coefficient to the standard error is known as the t-statistic. This will also 
be explained in the chapter appendix. 

* Broader discriminatory forces in society may affect the careers that women choose or the fact that 
they are more likely than men to interrupt their careers to take care of children. However, these 
important issues are distinct from the narrower question of whether women are being paid less than 
men to do the same jobs. 

* These studies differ slightly from the regression equations introduced earlier in the chapter. The 
outcome of interest, or dependent variable, is binary in these studies. A participant either has some 
kind of heart-related health problem during the period of study or he does not. As a result, the 
researchers use a tool called multivariate logistic regression. The basic idea is the same as the 
ordinary least squares models described in this chapter. Each coefficient expresses the effect of a 
particular explanatory variable on the dependent variable while holding the effects of other variables 
in the model constant. The key difference is that the variables in the equation all affect the likelihood 
that some event happens, such as having a heart attack during the period of study. In this study, for 
example, workers in the low control group are 1.99 times as likely to have “any coronary event” over 
the period of study as workers in the high control group after controlling for other coronary risk 
factors. 

* The more general formula for calculating a t-statistic is the following: 


where b is the observed coefficient, b 0 is the null hypothesis for that coefficient, and se^ is the 
standard error for the observed coefficient b. 


CHAPTER 12 


Common Regression Mistakes 
The mandatory warning label 


Here is one of the most important things to remember when doing research 
that involves regression analysis: Try not to kill anyone. You can even put a 
little Post-it note on your computer monitor: “Do not kill people with your 
research.” Because some very smart people have inadvertently violated that 
rule. 

Beginning in the 1990s, the medical establishment coalesced around the 
idea that older women should take estrogen supplements to protect against 
heart disease, osteoporosis, and other conditions associated with 
menopause. 1 By 2001, some 15 million women were being prescribed 
estrogen in the belief that it would make them healthier. Why? Because 
research at the time—using the basic methodology laid out in the last 
chapter—suggested this was a sensible medical strategy. In particular, a 
longitudinal study of 122,000 women (the Nurses’ Health Study) found a 
negative association between estrogen supplements and heart attacks. 
Women taking estrogen had one-third as many heart attacks as women who 
were not taking estrogen. This was not a couple of teenagers using dad’s 
computer to check out pornography and run regression equations. The 
Nurses’ Health Study is run by the Harvard Medical School and the 
Harvard School of Public Health. 

Meanwhile, scientists and physicians offered a medical theory for why 
hormone supplements might be beneficial for female health. A woman’s 
ovaries produce less estrogen as she ages; if estrogen is important to the 
body, then making up for this deficit in old age could be protective of a 
woman’s long-term health. Hence the name of the treatment: hormone 
replacement therapy. Some researchers even began to suggest that older 
men should receive an estrogen boost. 2 


And then, while millions of women were being prescribed hormone 
replacement therapy, estrogen was subjected to the most rigorous form of 
scientific scrutiny: clinical trials. Rather than searching a large data set like 
the Nurses’ Health Study for statistical associations that may or may not be 
causal, a clinical trial consists of a controlled experiment. One sample is 
given a treatment, such as hormone replacement; another sample is given a 
placebo. Clinical trials showed that women taking estrogen had a higher 
incidence of heart disease, stroke, blood clots, breast cancer, and other 
adverse health outcomes. Estrogen supplements did have some benefits, but 
those benefits were far outweighed by other risks. Beginning in 2002, 
doctors were advised not to prescribe estrogen for their aging female 
patients. The New York Times Magazine asked a delicate but socially 
significant question: How many women died prematurely or suffered 
strokes or breast cancer because they were taking a pill that their doctors 
had prescribed to keep them healthy? 

The answer: “A reasonable estimate would be tens of thousands.” 3 

Regression analysis is the hydrogen bomb of the statistics arsenal. Every 
person with a personal computer and a large data set can be a researcher in 
his or her own home or cubicle. What could possibly go wrong? All kinds 
of things. Regression analysis provides precise answers to complicated 
questions. These answers may or may not be accurate. In the wrong hands, 
regression analysis will yield results that are misleading or just plain wrong. 
And, as the estrogen example illustrates, even in the right hands this 
powerful statistical tool can send us speeding dangerously in the wrong 
direction. The balance of this chapter will explain the most common 
regression “mistakes.” I put “mistakes” in quotation marks, because, as with 
all other kinds of statistical analysis, clever people can knowingly exploit 
these methodological points to nefarious ends. 

Here is a “Top Seven” list of the most common abuses of an otherwise 
extraordinary tool. 

Using regression to analyze a nonlinear relationship. Have you ever read 
the warning label on a hair dryer—the part that cautions. Do Not Use in the 
Bath Tub? And you think to yourself, “What kind of moron uses a hair 
dryer in the bath tub?” It's an electrical appliance; you don't use electrical 


appliances around water. They’re not designed for that. If regression 
analysis had a similar warning label, it would say. Do Not Use When There 
Is Not a Linear Association between the Variables That Yon Are Analyzing. 
Remember, a regression coefficient describes the slope of the “line of best 
fit” for the data; a line that is not straight will have a different slope in 
different places. As an example, consider the following hypothetical 
relationship between the number of golf lessons that I take during a month 
(an explanatory variable) and my average score for an eighteen-hole round 
during that month (the dependent variable). As you can see from the scatter 
plot, there is no consistent linear relationship. 

Effect of Golf Lessons on Score 
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There is a pattern, but it cannot be easily described with a single straight 
line. The first few golf lessons appear to bring my score down rapidly. 
There is a negative association between lessons and my scores for this 
stretch; the slope is negative. More lessons yield lower scores (which is 
good in golf). 

But then when I reach the point where I’m spending between $200 and 
$300 a month on lessons, the lessons do not seem to have much effect at all. 
There is no clear association over this stretch between additional instruction 
and my golf scores; the slope is zero. 

And finally, the lessons appear to become counterproductive. Once I’m 
spending $300 a month on instruction, incremental lessons are associated 
with higher scores; the slope is positive over this stretch. (I’ll discuss the 



distinct possibility that the bad golf may be causing the lessons, rather than 
the other way around, later in the chapter.) 

The most important point here is that we cannot accurately summarize 
the relationship between lessons and scores with a single coefficient. The 
best interpretation of the pattern described above is that golf lessons have 
several different linear relationships with my scores. You can see that; a 
statistics package will not. If you feed these data into a regression equation, 
the computer will give you a single coefficient. That coefficient will not 
accurately reflect the true relationship between the variables of interest. The 
results you get will be the statistical equivalent of using a hair dryer in the 
bath tub. 

Regression analysis is meant to be used when the relationship between 
variables is linear. A textbook or an advanced course in statistics will walk 
you through the other core assumptions underlying regression analysis. As 
with any other tool, the further one deviates from its intended use, the less 
effective, or even potentially dangerous, it’s going to be. 

Correlation does not equal causation. Regression analysis can only 
demonstrate an association between two variables. As I have mentioned 
before, we cannot prove with statistics alone that a change in one variable is 
causing a change in the other. In fact, a sloppy regression equation can 
produce a large and statistically significant association between two 
variables that have nothing to do with one another. Suppose we were 
searching for potential causes for the rising rate of autism in the United 
States over the last two decades. Our dependent variable—the outcome we 
are seeking to explain—would be some measure of the incidence of the 
autism by year, such as the number of diagnosed cases for every 1,000 
children of a certain age. If we were to include annual per capita income in 
China as an explanatory variable, we would almost certainly find a positive 
and statistically significant association between rising incomes in China and 
rising autism rates in the U.S. over the past twenty years. 

Why? Because they both have been rising sharply over the same period. 
Yet I highly doubt that a sharp recession in China would reduce the autism 
rate in the United States. (To be fair, if I observed a strong relationship 
between rapid economic growth in China and autism rates in China alone, I 



might begin to search for some environmental factor related to economic 
growth, such as industrial pollution, that might explain the association.) 

The kind of false association between two variables that I have just 
illustrated is just one example of a more general phenomenon known as 
spurious causation. There are several other ways in which an association 
between A and B can be wrongly interpreted. 

Reverse causality. A statistical association between A and B does not prove 
that A causes B. In fact, it’s entirely plausible that B is causing A. I alluded 
to this possibility earlier in the golf lesson example. Suppose that when I 
build a complex model to explain my golf scores, the variable for golf 
lessons is consistently associated with worse scores. The more lessons I 
take, the worse I shoot! One explanation is that I have a really, really bad 
golf instructor. A more plausible explanation is that I tend to take more 
lessons when I’m playing poorly; bad golf is causing more lessons, not the 
other way around. (There are some simple methodological fixes to a 
problem of this nature. For example, I might include golf lessons in one 
month as an explanatory variable for golf scores in the next month.) 

As noted earlier in the chapter, causality may go in both directions. 
Suppose you do some research demonstrating that states that spend more 
money on K-12 education have higher rates of economic growth than states 
that spend less on K-12 education. A positive and significant association 
between these two variables does not provide any insight into which 
direction the relationship happens to run. Investments in K-12 education 
could cause economic growth. On the other hand, states that have strong 
economies can afford to spend more on K-12 education, so the strong 
economy could be causing the education spending. Or, education spending 
could boost economic growth, which makes possible additional education 
spending—the causality could be going in both ways. 

The point is that we should not use explanatory variables that might be 
affected by the outcome that we are trying to explain, or else the results will 
become hopelessly tangled. For example, it would be inappropriate to use 
the unemployment rate in a regression equation explaining GDP growth, 
since unemployment is clearly affected by the rate of GDP growth. Or, to 
think of it another way, a regression analysis finding that lowering 
unemployment will boost GDP growth is a silly and meaningless finding. 



since boosting GDP growth is usually required in order to reduce 
unemployment. 

We should have reason to believe that our explanatory variables affect 
the dependent variable, and not the other way around. 

Omitted variable bias. You should be skeptical the next time you see a huge 
headline proclaiming, “Golfers More Prone to Heart Disease, Cancer, and 
Arthritis!” I would not be surprised if golfers have a higher incidence of all 
of those diseases than nongolfers; I also suspect that golf is probably good 
for your health because it provides socialization and modest exercise. How 
can I reconcile those two statements? Very easily. Any study that attempts 
to measure the effects of playing golf on health must control properly for 
age. In general, people play more golf when they get older, particularly in 
retirement. Any analysis that leaves out age as an explanatory variable is 
going to miss the fact that golfers, on average, will be older than 
nongolfers. Golf isn’t killing people; old age is killing people, and they 
happen to enjoy playing golf while it does. I suspect that when age is 
inserted into the regression analysis as a control variable, we will get a 
different outcome. Among people who are the same age, golf may be mildly 
preventive of serious illnesses. That’s a pretty big difference. 

In this example, age is an important “omitted variable.” When we leave 
age out of a regression equation explaining heart disease or some other 
adverse health outcome, the “playing golf” variable takes on two 
explanatory roles rather than just one. It tells us the effect of playing golf on 
heart disease, and it tells us the effect of being old on heart disease (since 
golfers tend to be older than the rest of the population). In the statistics 
lingo, we would say that the golf variable is “picking up” the effect of age. 
The problem is that these two effects are comingled. At best, our results are 
a jumbled mess. At worst, we wrongly assume that golf is bad for your 
health, when in fact the opposite is likely to be true. 

Regression results will be misleading and inaccurate if the regression 
equation leaves out an important explanatory variable, particularly if other 
variables in the equation “pick up” that effect. Suppose we are trying to 
explain school quality. This is an important outcome to understand: What 
makes good schools? Our dependent variable—the quantifiable measure of 
quality—would most likely be test scores. We would almost certainly 



examine school spending as one explanatory variable in hopes of 
quantifying the relationship between spending and test scores. Do schools 
that spend more get better results? If school spending were the only 
explanatory variable, I have no doubt that we would find a large and 
statistically significant relationship between spending and test scores. Yet 
that finding, and the implication that we can spend our way to better 
schools, is deeply flawed. 

There are many potentially significant omitted variables here, but the 
crucial one is parental education. Well-educated families tend to live in 
affluent areas that spend a lot of money on their schools; such families also 
tend to have children who score well on tests (and poor families are more 
likely to have students who struggle). If we do not have some measure of 
the socioeconomic status of the student body as a control variable, our 
regression results will probably show a large positive association between 
school spending and test scores—when in fact, those results may be a 
function of the kind of students who are walking in the school door, not the 
money that is being spent in the building. 

I remember a college professor’s pointing out that SAT scores are highly 
correlated with the number of cars that a family owns. He insinuated that 
the SAT was therefore an unfair and inappropriate tool for college 
admissions. The SAT has its flaws but the correlation between scores and 
family cars is not the one that concerns me most. I do not worry much that 
rich families can get their kids into college by purchasing three extra 
automobiles. The number of cars in a family’s garage is a proxy for their 
income, education, and other measures of socioeconomic status. The fact 
that wealthy kids do better on the SAT than poor kids is not news. (As noted 
earlier, the mean SAT critical reading score for students from families with 
a household income over $200,000 is 134 points higher than the mean score 
for students in households with income below $20,000.)" The bigger 
concern should be whether or not the SAT is “coachable.” How much can 
students improve their scores by taking private SAT prep classes? Wealthy 
families clearly are better able to send their children to test prep classes. 
Any causal improvement between these classes and SAT scores would 
favor students from wealthy families relative to more disadvantaged 


students of equal abilities (who presumably also could have raised their 
scores with a prep class but never had that opportunity). 

Highly correlated explanatory variables (multicollinearity). If a regression 
equation includes two or more explanatory variables that are highly 
correlated with each other, the analysis will not necessarily be able to 
discern the true relationship between each of those variables and the 
outcome that we are trying to explain. An example will make this clearer. 
Assume we are trying to gauge the effect of illegal drug use on SAT scores. 
Specifically, we have data on whether the participants in our study have 
ever used cocaine and also on whether they have ever used heroin. (We 
would presumably have many other control variables as well.) What is the 
impact of cocaine use on SAT scores, holding other factors constant, 
including heroin use? And what is the impact of heroin use on SAT scores, 
controlling for cocaine use and other factors? 

The coefficients on heroin and cocaine use might not actually be able to 
tell us that. The methodological challenge is that people who have used 
heroin have likely also used cocaine. If we put both variables in the 
equation, we will have very few individuals who have used one drug but 
not the other, which leaves us very little variation in the data with which to 
calculate their independent effects. Think back for a moment to the mental 
imagery used to explain regression analysis in the last chapter. We divide 
our data sample into different “rooms” in which each observation is 
identical except for one variable, which then allows us to isolate the effect 
of that variable while controlling for other potential confounding factors. 
We may have 692 individuals in our sample who have used both cocaine 
and heroin. However, we may have only 3 individuals who have used 
cocaine but not heroin and 2 individuals who have used heroin and not 
cocaine. Any inference about the independent effect of just one drug or the 
other is going to be based on these tiny samples. 

We are unlikely to get meaningful coefficients on either the cocaine or 
the heroin variable; we may also obscure the larger and more important 
relationship between SAT scores and using either one of these drugs. When 
two explanatory variables are highly correlated, researchers will usually use 
one or the other in the regression equation, or they may create some kind of 
composite variable, such as “used cocaine or heroin.” For example, when 



researchers want to control for a student’s overall socioeconomic 
background, they may include variables for both “mother’s education” and 
“father’s education,” since this inclusion provides important insight into the 
educational background of the household. However, if the goal of the 
regression analysis is to isolate the effect of either a mother’s or a father’s 
education, then putting both variables into the equation is more likely to 
confuse the issue than to clarify it. The correlation between a husband’s and 
a wife’s educational attainments is so high that we cannot depend on 
regression analysis to give us coefficients that meaningfully isolate the 
effect of either parent’s education (just as it is hard to separate the impact of 
cocaine use from the impact of heroin use). 

Extrapolating beyond the data. Regression analysis, like all forms of 
statistical inference, is designed to offer us insights into the world around 
us. We seek patterns that will hold true for the larger population. However, 
our results are valid only for a population that is similar to the sample on 
which the analysis has been done. In the last chapter, I created a regression 
equation to predict weight based on a number of independent variables. The 
R 2 of my final model was .29, which means that it did a decent job of 
explaining the variation in weight for a large sample of individuals—all of 
whom happened to be adults. 

So what happens if we use our regression equation to predict the likely 
weight of a newborn? Let’s try it. My daughter was 21 inches when she was 
born. We’ll say that her age at birth was zero; she had no education and did 
not exercise. She was white and female. The regression equation based on 
the Changing Lives data predicts that her weight at birth should have been 
negative 19.6 pounds. (She weighed 8 V 2 pounds.) 

The authors of one of the Whitehall studies referred to in the last chapter 
were strikingly explicit in drawing their narrow conclusion: “Low control in 
the work environment is associated with an increased risk of future 
coronary heart disease among men and women employed in government 
offices ” 5 (italics added). 

Data mining (too many variables). If omitting important variables is a 
potential problem, then presumably adding as many explanatory variables 
as possible to a regression equation must be the solution. Nope. 


Your results can be compromised if you include too many variables, 
particularly extraneous explanatory variables with no theoretical 
justification. For example, one should not design a research strategy built 
around the following premise: Since we don’t know what causes autism, we 
should put as many potential explanatory variables as possible in the 
regression equation just to see what might turn up as statistically 
significant; then maybe we’ll get some answers. If you put enough junk 
variables in a regression equation, one of them is bound to meet the 
threshold for statistical significance just by chance. The further danger is 
that junk variables are not always easily recognized as such. Clever 
researchers can always build a theory after the fact for why some curious 
variable that is really just nonsense turns up as statistically significant. 

To make this point, I often do the same coin flipping exercise that I 
explained during the probability discussion. In a class of forty students or 
so, I’ll have each student flip a coin. Any student who flips tails is 
eliminated; the rest flip again. In the second round, those who flip tails are 
once again eliminated. I continue the rounds of flipping until one student 
has flipped five or six heads in a row. You may recall some of the silly 
follow-up questions: “What’s your secret? Is it in the wrist? Can you teach 
us to flip heads all the time? Maybe it’s that Harvard sweatshirt you’re 
wearing.” 

Obviously the string of heads is just luck; the students have all watched it 
happen. However, that is not necessarily how the result could or would be 
interpreted in a scientific context. The probability of flipping five heads in a 
row is 1/32, or .03. This is comfortably below the .05 threshold we typically 
use to reject a null hypothesis. Our null hypothesis in this case is that the 
student has no special talent for flipping heads; the lucky string of heads 
(which is bound to happen for at least one student when I start with a large 
group) allows us to reject the null hypothesis and adopt the alternative 
hypothesis: This student has a special ability to flip heads. After he has 
achieved this impressive feat, we can study him for clues about his flipping 
success—his flipping form, his athletic training, his extraordinary 
concentration while the coin is in the air, and so on. And it is all nonsense. 

This phenomenon can plague even legitimate research. The accepted 
convention is to reject a null hypothesis when we observe something that 
would happen by chance only 1 in 20 times or less if the null hypothesis 



were true. Of course, if we conduct 20 studies, or if we include 20 junk 
variables in a single regression equation, then on average we will get 1 
bogus statistically significant finding. The New York Times Magazine 
captured this tension wonderfully in a quotation from Richard Peto, a 
medical statistician and epidemiologist: “Epidemiology is so beautiful and 
provides such an important perspective on human life and death, but an 
incredible amount of rubbish is published.” 6 

Even the results of clinical trials, which are usually randomized 
experiments and therefore the gold standard of medical research, should be 
viewed with some skepticism. In 2011, the Wall Street Journal ran a front¬ 
page story on what it described as one of the “dirty little secrets” of medical 
research: “Most results, including those that appear in top-flight peer- 
reviewed journals, can’t be reproduced.” 7 (A peer-reviewed journal is a 
publication in which studies and articles are reviewed for methodological 
soundness by other experts in the same field before being approved for 
publication; such publications are considered to be the gatekeepers for 
academic research.) One reason for this “dirty little secret” is the positive 
publication bias described in Chapter 7. If researchers and medical journals 
pay attention to positive findings and ignore negative findings, then they 
may well publish the one study that finds a drug effective and ignore the 
nineteen in which it has no effect. Some clinical trials may also have small 
samples (such as for a rare diseases), which magnifies the chances that 
random variation in the data will get more attention than it deserves. On top 
of that, researchers may have some conscious or unconscious bias, either 
because of a strongly held prior belief or because a positive finding would 
be better for their career. (No one ever gets rich or famous by proving what 
doesn’t cure cancer.) 

For all of these reasons, a shocking amount of expert research turns out 
to be wrong. John Ioannidis, a Greek doctor and epidemiologist, examined 
forty-nine studies published in three prominent medical journals. 8 Each 
study had been cited in the medical literature at least a thousand times. Yet 
roughly one-third of the research was subsequently refuted by later work. 
(For example, some of the studies he examined promoted estrogen 
replacement therapy.) Dr. Ioannidis estimates that roughly half of the 
scientific papers published will eventually turn out to be wrong. 9 His 


research was published in the Journal of the American Medical Association, 
one of the journals in which the articles he studied had appeared. This does 
create a certain mind-bending irony: If Dr. Ioannidis’s research is correct, 
then there is a good chance that his research is wrong. 

Regression analysis is still an awesome statistical tool. (Okay, perhaps my 
description of it as a “miracle elixir” in the last chapter was a little 
hyperbolic.) Regression analysis enables us to find key patterns in large 
data sets, and those patterns are often the key to important research in 
medicine and the social sciences. Statistics gives us objective standards for 
evaluating these patterns. When used properly, regression analysis is an 
important part of the scientific method. Consider this chapter to be the 
mandatory warning label. 

All of the assorted specific warnings on that label can be boiled down to 
two key lessons. First, designing a good regression equation—figuring out 
what variables should be examined and where the data should come from— 
is more important than the underlying statistical calculations. This process 
is referred to as estimating the equation, or specifying a good regression 
equation. The best researchers are the ones who can think logically about 
what variables ought to be included in a regression equation, what might be 
missing, and how the eventual results can and should be interpreted. 

Second, like most other statistical inference, regression analysis builds 
only a circumstantial case. An association between two variables is like a 
fingerprint at the scene of the crime. It points us in the right direction, but 
it’s rarely enough to convict. (And sometimes a fingerprint at the scene of a 
crime doesn’t belong to the perpetrator.) Any regression analysis needs a 
theoretical underpinning: Why are the explanatory variables in the 
equation? What phenomena from other disciplines can explain the observed 
results? For instance, why do we think that wearing purple shoes would 
boost performance on the math portion of the SAT or that eating popcorn 
can help prevent prostate cancer? The results need to be replicated, or at 
least consistent with other findings. 

Even a miracle elixir won’t work when not taken as directed. 


* There are more sophisticated methods that can be used to adapt regression analysis for use with 
nonlinear data. Before using those tools, however, you need to appreciate why using the standard 


ordinary least squares approach with nonlinear data will give you a meaningless result. 



CHAPTER 13 


Program Evaluation 
Will going to Harvard change your life? 


Brilliant researchers in the social sciences are not brilliant because they can 
do complex calculations in their heads, or because they win more money on 
Jeopardy than less brilliant researchers do (though both these feats may be 
true). Brilliant researchers—those who appreciably change our knowledge 
of the world—are often individuals or teams who find creative ways to do 
“controlled” experiments. To measure the effect of any treatment or 
intervention, we need something to measure it against. How would going to 
Harvard affect your life? Well, to answer that question, we have to know 
what happens to you after you go to Harvard—and what happens to you 
after you don’t go to Harvard. Obviously we can’t have data on both. Yet 
clever researchers find ways to compare some treatment (e.g., going to 
Harvard) with the counterfactual, which is what would have happened in 
the absence of that treatment. 

To illustrate this point, let’s ponder a seemingly simple question: Does 
putting more police officers on the street deter crime? This is a socially 
significant question, as crime imposes huge costs on society. If a greater 
police presence lowers crime, either through deterrence or by catching and 
imprisoning bad guys, then investments in additional police officers could 
have large returns. On the other hand, police officers are relatively 
expensive; if they have little or no impact on crime reduction, then society 
could make better use of its resources elsewhere (perhaps with investments 
in crime-fighting technology, such as surveillance cameras). 

The challenge is that our seemingly simple question—what is the causal 
effect of more police officers on crime?—turns out to be very difficult to 
answer. By this point in the book, you should recognize that we cannot 
answer this question simply by examining whether jurisdictions with a high 
number of police officers per capita have lower rates of crime. Zurich is not 


Los Angeles. Even a comparison of large American cities will be deeply 
flawed; Los Angeles, New York, Houston, Miami, Detroit, and Chicago are 
all different places with different demographics and crime challenges. 

Our usual approach would be to attempt to specify a regression equation 
that controls for these differences. Alas, even multiple regression analysis is 
not going to save us here. If we attempt to explain crime rates (our 
dependent variable) by using police officers per capita as an explanatory 
variable (along with other controls), we will have a serious reverse causality 
problem. We have a solid theoretical reason to believe that putting more 
police officers on the street will reduce crime, but it’s also possible that 
crime could “cause” police officers, in the sense that cities experiencing 
crime waves will hire more police officers. We could easily find a positive 
but misleading association between crime and police: the places with the 
most police officers have the worst crime problems. Of course, the places 
with lots of doctors also tend to have the highest concentration of sick 
people. These doctors aren’t making people sick; they are located in places 
where they are needed most (and at the same time sick people are moving 
to places where they can get appropriate medical care). I suspect that there 
are disproportionate numbers of oncologists and cardiologists in Florida; 
banishing them from the state will not make the retiree population healthier. 

Welcome to program evaluation, which is the process by which we seek 
to measure the causal effect of some intervention—anything from a new 
cancer drug to a job placement program for high school dropouts. Or 
putting more police officers on the street. The intervention that we care 
about is typically called the “treatment,” though that word is used more 
expansively in a statistical context than in normal parlance. A treatment can 
be a literal treatment, as in some kind of medical intervention, or it can be 
something like attending college or receiving job training upon release from 
prison. The point is that we are seeking to isolate the effect of that single 
factor; ideally we would like to know how the group receiving that 
treatment fares compared with some other group whose members are 
identical in all other respects but for the treatment. 

Program evaluation offers a set of tools for isolating the treatment effect 
when cause and effect are otherwise elusive. Here is how Jonathan Klick 
and Alexander Tabarrok, researchers at the University of Pennsylvania and 
George Mason University, respectively, studied how putting more police 



officers on the street affects the crime rate. Their research strategy made use 
of the terrorism alert system. Specifically, Washington, D.C., responds to 
“high alert” days for terrorism by putting more officers in certain areas of 
the city, since the capital is a natural terrorism target. We can assume that 
there is no relationship between street crime and the terrorism threat, so this 
boost in the D.C. police presence is unrelated to the conventional crime 
rate, or “exogenous.” The researchers’ most valuable insight was 
recognizing the natural experiment here: What happens to ordinary crime 
on terrorism “high alert” days? 

The answer: The number of crimes committed when the terrorism threat 
was Orange (high alert and more police) was roughly 7 percent lower than 
when the terrorism threat level was Yellow (elevated alert but no extra law 
enforcement precautions). The authors also found that the decrease in crime 
was sharpest in the police district that gets the most police attention on 
high-alert days (because it includes the White House, the Capitol, and the 
National Mall). The important takeaway is that we can answer tricky but 
socially meaningful questions—we just have to be clever about it. Here are 
some of the most common approaches for isolating a treatment effect. 

Randomized, controlled experiments. The most straightforward way to 
create a treatment and control group is to—wait for it—create a treatment 
and control group. There are two big challenges to this approach. First, 
there are many kinds of experiments that we cannot perform on people. 
This constraint (I hope) is not going away anytime soon. As a result, we can 
do controlled experiments on human subjects only when there is reason to 
believe that the treatment effect has a potentially positive outcome. This is 
often not the case (e.g., “treatments” like experimenting with drugs or 
dropping out of high school), which is why we need the strategies 
introduced in the balance of the chapter. 

Second, there is a lot more variation among people than among 
laboratory rats. The treatment effect that we are testing could easily be 
confounded by other variations in the treatment and control groups; you are 
bound to have tall people, short people, sick people, healthy people, males, 
females, criminals, alcoholics, investment bankers, and so on. How can we 
ensure that differences across these other characteristics don’t mess up the 
results? I have good news: This is one of those rare instances in life in 



which the best approach involves the least work! The optimal way to create 
any treatment and control group is to distribute the study participants 
randomly across the two groups. The beauty of randomization is that it will 
generally distribute the non-treatment-related variables more or less evenly 
between the two groups—both the characteristics that are obvious, such as 
sex, race, age, and education and the nonobservable characteristics that 
might otherwise mess up the results. 

Think about it: If we have 1,000 females in our prospective sample, then 
when we split the sample randomly into two groups, the most likely 
outcome is that 500 females will end up in each. Obviously we can’t expect 
that split exactly, but once again probability is our friend. The probability is 
low that one group will get a disproportionate number of women (or a 
disproportionate number of individuals with any other characteristic). For 
example, if we have a sample of 1,000 people, half of whom are women, 
there is less than a 1 percent chance of getting fewer than 450 women in 
one group or the other. Obviously the bigger the samples, the more effective 
randomization will be in creating two broadly similar groups. 

Medical trials typically aspire to do randomized, controlled experiments. 
Ideally these clinical trials are double-blind, meaning that neither the patient 
nor the physician knows who is receiving the treatment and who is getting a 
placebo. This is obviously impossible with treatments such as surgical 
procedures (the heart surgeon will, one hopes, know which patients are 
getting bypass surgery). Even with surgical procedures, however, it may 
still be possible to keep patients from learning whether they are in the 
treatment or the control group. One of my favorite studies involved an 
evaluation of a certain kind of knee surgery to alleviate pain. The treatment 
group was given the surgery. The control group was given a “sham” surgery 
in which the surgeon made three small incisions in the knee and “pretended 
to operate.”* It turned out that the real surgery was no more effective than 
the sham surgery in relieving knee pain. 1 

Randomized trials can be used to test some interesting phenomena. For 
example, do prayers offered by strangers improve postsurgical outcomes? 
Reasonable people have widely varying views on religion, but a study 
published in the American Heart Journal conducted a controlled study that 
examined whether patients recovering from heart bypass surgery would 


have fewer postoperative complications if a large group of strangers prayed 
for their safe and speedy recovery/ The study involved 1,800 patients and 
members of three religious congregations from across the country. The 
patients, all of whom received coronary bypass surgery, were divided into 
three groups: one group was not prayed for; one group was prayed for and 
was told so; the third group was prayed for, but the participants in that 
group were told that they might or might not receive prayers (thereby 
controlling for a prayer placebo effect). Meanwhile, the members of the 
religious congregations were told to offer prayers for specific patients by 
first name and the first initial of their last name (e.g., Charlie W.). The 
congregants were given latitude in how they prayed, so long as the prayer 
included the phrase “for a successful surgery with a quick, healthy recovery 
and no complications.” 

And? Will prayer be the cost-effective solution to America’s health care 
challenges? Probably not. The researchers did not find any difference in the 
rate of complications within thirty days of surgery for those who were 
offered prayers compared with those who were not. Critics of the study 
pointed out a potential omitted variable: prayers coming from other sources. 
As the New York Times summarized, “Experts said the study could not 
overcome perhaps the largest obstacle to prayer study: the unknown amount 
of prayer each person received from friends, families, and congregations 
around the world who pray daily for the sick and dying.” 

Experimenting on humans can get you arrested, or perhaps hauled off to 
appear before some international criminal tribunal. You should be aware of 
this. However, there is still room in the social sciences for randomized, 
controlled experiments involving “human subjects.” One famous and 
influential experiment is the Tennessee Project STAR experiment, which 
tested the effect of smaller class sizes on student learning. The relationship 
between class size and learning is hugely important. Nations around the 
world are struggling to improve educational outcomes. If smaller classes 
promote more effective learning, ceteris paribus, then society ought to 
invest in hiring more teachers to bring class sizes down. At the same time, 
hiring teachers is expensive; if students in smaller classes are doing better 
for reasons unrelated to the size of the class, then we could end up wasting 
an enormous amount of money. 


The relationship between class size and student achievement is 
surprisingly hard to study. Schools with small classes generally have greater 
resources, meaning that both the students and the teachers are likely to be 
different from students and teachers in schools with larger classes. And 
within schools, smaller classes tend to be smaller for a reason. A principal 
may assign difficult students to a small class, in which case we might find a 
spurious negative association between smaller classes and student 
achievement. Or veteran teachers may choose to teach small classes, in 
which case the benefit of small classes may come from the teachers who 
choose to teach them rather than from the lower pupil-teacher ratio. 

Beginning in 1985, Tennessee’s Project STAR did a controlled 
experiment to test the effects of smaller classes. 3 (Lamar Alexander was 
governor of Tennessee at the time; he later went on to become secretary of 
education under President George H. W. Bush.) In kindergarten, students in 
seventy-nine different schools were randomly assigned to either a small 
class (13-17 students), a regular class (22-25 students), or a regular class 
with both a regular teacher and a teacher’s aide. Teachers were also 
randomly assigned to the different classrooms. Students stayed in the class 
type to which they were randomly assigned through third grade. Assorted 
life realities chipped away at the randomization. Some students entered the 
system in the middle of the experiment; others left. Some students were 
moved from class to class for disciplinary reasons; some parents lobbied 
successfully to have students moved to smaller classes. And so on. 

Still, Project STAR remains the only randomized test of the effects of 
smaller classes. The results turned out to be statistically and socially 
significant. Overall, students in the small classes performed .15 standard 
deviations better on standardized tests than students in the regular-size 
classes; black students in small classes had gains that were twice as large. 
Now the bad news. The Project STAR experiment cost roughly $12 million. 
The study on the effect of prayer on postsurgical complications cost $2.4 
million. The finest studies are like the finest of anything else: They cost big 
bucks. 

Natural experiment. Not everybody has millions of dollars lying around to 
create a large, randomized trial. A more economical alternative is to exploit 
a natural experiment, which happens when random circumstances somehow 


create something approximating a randomized, controlled experiment. This 
was the case with our Washington, D.C., police example at the beginning of 
the chapter. Life sometimes creates a treatment and control group by 
accident; when that occurs, researchers are eager to leap on the results. 
Consider the striking but complicated link between education and longevity. 
People who get more education tend to live longer, even after controlling 
for things like income and access to health care. As the New York Times has 
noted, “The one social factor that researchers agree is consistently linked to 
longer lives in every country where it has been studied is education. It is 
more important than race; it obliterates any effects of income.” 4 But so far, 
that’s just a correlation. Does more education, ceteris paribus, cause better 
health? If you think of the education itself as the “treatment,” will getting 
more education make you live longer? 

This would appear to be a nearly impossible question to study, since 
people who choose to get more education are different from people who 
don’t. The difference between high school graduates and college graduates 
is not just four years of schooling. There could easily be some unobservable 
characteristics shared by people who pursue education that also explain 
their longer life expectancy. If that is the case, offering more education to 
those who would have chosen less education won’t actually improve their 
health. The improved health would not be a function of the incremental 
education; it would be a function of the kind of people who pursue that 
incremental education. 

We cannot conduct a randomized experiment to solve this conundrum, 
because that would involve making some participants leave school earlier 
than they would like. (You try explaining to someone that he can’t go to 
college—ever—because he is in the control group.) The only possible test 
of the causal effect of education on longevity would be some kind of 
experiment that forced a large segment of the population to stay in school 
longer than its members might otherwise choose. That’s at least morally 
acceptable since we expect a positive treatment effect. Still, we can’t force 
kids to stay in school; that’s not the American way. 

Oh, but it is. Every state has some kind of minimum schooling law, and 
at different points in history those laws have changed. That kind of 
exogenous change in schooling attainment—meaning that it’s not caused by 


the individuals being studied—is exactly the kind of thing that makes 
researchers swoon with excitement. Adriana Lleras-Muney, a graduate 
student at Columbia, saw the research potential in the fact that different 
states have changed their minimum schooling laws at different points in 
time. She went back in history and studied the relationship between when 
states changed their minimum schooling laws and later changes in life 
expectancy in those states (by trolling through lots and lots of census data). 
She still had a methodological challenge; if the residents of a state live 
longer after the state raises its minimum schooling law, we cannot attribute 
the longevity to the extra schooling. Life expectancy is generally going up 
over time. People lived longer in 1900 than in 1850, no matter what the 
states did. 

However, Lleras-Muney had a natural control: states that did not change 
their minimum schooling laws. Her work approximates a giant laboratory 
experiment in which the residents of Illinois are forced to stay in school for 
seven years while their neighbors in Indiana can leave school after six 
years. The difference is that this controlled experiment was made possible 
by a historical accident—hence the term “natural experiment.” 

What happened? Life expectancy of those adults who reached age thirty- 
five was extended by an extra one and a half years just by their attending 
one additional year of school/ Lleras-Muney’s results have been replicated 
in other countries where variations in mandatory schooling laws have 
created similar natural experiments. Some skepticism is in order. We still do 
not understand the mechanism by which additional schooling leads to 
longer lives. 

Nonequivalent control. Sometimes the best available option for studying a 
treatment effect is to create nonrandomized treatment and control groups. 
Our hope/expectation is that the two groups are broadly similar even though 
circumstances have not allowed us the statistical luxury of randomizing. 
The good news is that we have a treatment and a control group. The bad 
news is that any nonrandom assignment creates at least the potential for 
bias. There may be unobserved differences between the treatment and 
control groups related to how participants are assigned to one group or the 
other. Hence the name “nonequivalent control.” 


A nonequivalent control group can still be a very helpful tool. Let’s 
ponder the question posed in the title of this chapter: Is there a significant 
life advantage to attending a highly selective college or university? 
Obviously the Harvard, Princeton, and Dartmouth graduates of the world do 
very well. On average, they earn more money and have more expansive life 
opportunities than students who attend less selective institutions. (A 2008 
study by PayScale.com found that the median pay for Dartmouth graduates 
with ten to twenty years of work experience was $134,000, the highest of 
any undergraduate institution; Princeton was second with a median of 
$131,000.) 6 As I hope you realize by this point, these impressive numbers 
tell us absolutely nothing about the value of a Dartmouth or Princeton 
education. Students who attend Dartmouth and Princeton are talented when 
they apply; that's why they get accepted. They would probably do well in 
life no matter where they went to college. 

What we don’t know is the treatment effect of attending a place like 
Harvard or Yale. Do the graduates of these elite institutions do well in life 
because they were hyper-talented when they walked onto the campus? Or 
do these colleges and universities add value by taking talented individuals 
and making them even more productive? Or both? 

We cannot conduct a randomized experiment to answer this question. 
Few high school students would agree to be randomly assigned to a college; 
nor would Harvard and Dartmouth be particularly keen about taking the 
students randomly assigned to them. We appear to be left without any 
mechanism for testing the value of the treatment effect. Cleverness to the 
rescue! Economists Stacy Dale and Alan Krueger found a way to answer 
this question by exploiting the fact that many students apply to multiple 
colleges. Some of those students are accepted at a highly selective school 
and choose to attend that school; others are accepted at a highly selective 
school but choose to attend a less selective college or university instead. 
Bingo! Now we have a treatment group (those students who attended highly 
selective colleges and universities) and a nonequivalent control group 
(those students who were talented enough to be accepted by such a school 
but opted to attend a less selective institution instead). 1 " 

Dale and Krueger studied longitudinal data on the earnings of both 
groups. This is not a perfect apples-and-apples comparison, and earnings 


are clearly not the only life outcome that matters, but their findings should 
assuage the anxieties of overwrought high school students and their parents. 
Students who attended more selective colleges earned roughly the same as 
students of seemingly similar ability who attended less selective schools. 
The one exception was students from low-income families, who earned 
more if they attended a selective college or university. The Dale and 
Krueger approach is an elegant way to sort out the treatment effect 
(spending four years at an elite institution) from the selection effect (the 
most talented students are admitted to those institutions). In a summary of 
the research for the New York Times, Alan Krueger indirectly answered the 
question posed in the title of this chapter, “Recognize that your own 
motivation, ambition, and talents will determine your success more than the 
college name on your diploma.” 8 

Difference in differences. One of the best ways to observe cause and effect 
is to do something and then see what happens. This is, after all, how infants 
and toddlers (and sometimes adults) learn about the world. My children 
were very quick to learn that if they hurled pieces of food across the kitchen 
(cause), the dog would race eagerly after them (effect). Presumably the 
same power of observation can help inform the rest of life. If we cut taxes 
and the economy improves, then the tax cuts must have been responsible. 

Maybe. The enormous potential pitfall with this approach is that life 
tends to be more complex than throwing chicken nuggets across the 
kitchen. Yes, we may have cut taxes at a specific point in time, but there 
were other “interventions” unfolding during roughly the same stretch: More 
women were going to college, the Internet and other technological 
innovations were raising the productivity of American workers, the Chinese 
currency was undervalued, the Chicago Cubs fired their general manager, 
and so on. Whatever happened after the tax cut cannot be attributed solely 
to the tax cut. The challenge with any “before and after” kind of analysis is 
that just because one thing follows another does not mean that there is a 
causal relationship between the two. 

A “difference in differences” approach can help us identify the effects of 
some intervention by doing two things. First, we examine the “before” and 
“after” data for whatever group or jurisdiction has received the treatment, 
such as the unemployment figures for a county that has implemented a job 


training program. Second, we compare those data with the unemployment 
figures over the same time period for a similar county that did not 
implement any such program. 

The important assumption is that the two groups used for the analysis are 
largely comparable except for the treatment; as a result, any significant 
difference in outcomes between the two groups can be attributed to the 
program or policy being evaluated. For example, suppose that one county in 
Illinois implements a job training program to combat high unemployment. 
Over the ensuing two years, the unemployment rate continues to rise. Does 
that make the program a failure? Who knows? 

Effect of Job Training on Unemployment in County A 



Time 


Other broad economic forces may be at work, including the possibility of 
a prolonged economic slump. A difference-in-differences approach would 
compare the change in the unemployment rate over time in the county we 
are evaluating with the unemployment rate for a neighboring county with 
no job training program; the two counties must be similar in all other 
important ways: industry mix, demographics, and so on. How does the 
unemployment rate in the county with the new job training program change 
over time relative to the county that did not implement such a program ? We 
can reasonably infer the treatment effect of the program by comparing the 
changes in the two counties over the period of study—the “difference in 
differences.” The other county in this study is effectively acting as a control 




group, which allows us to take advantage of the data collected before and 
after the intervention. If the control group is good, it will be exposed to the 
same broader forces as our treatment group. The difference-in-differences 
approach can be particularly enlightening when the treatment initially 
appears ineffective (unemployment is higher after the program is 
implemented than before), yet the control group shows us that the trend 
would have been even worse in the absence of the intervention. 

Effect of Job Training on Unemployment in County A, with 

County B as a Comparison 



Time 


Discontinuity analysis. One way to create a treatment and control group 
is to compare the outcomes for some group that barely qualified for an 
intervention or treatment with the outcomes for a group that just missed the 
cutoff for eligibility and did not receive the treatment. Those individuals 
who fall just above and just below some arbitrary cutoff, such as an exam 
score or a minimum household income, will be nearly identical in many 
important respects; the fact that one group received the treatment and the 
other didn’t is essentially arbitrary. As a result, we can compare their 
outcomes in ways that provide meaningful results about the effectiveness of 
the relevant intervention. 

Suppose a school district requires summer school for struggling students. 
The district would like to know whether the summer program has any long¬ 
term academic value. As usual, a simple comparison between the students 





who attend summer school and those who do not would be worse than 
useless. The students who attend summer school are there because they are 
struggling. Even if the summer school program is highly effective, the 
participating students will probably still do worse in the long run than the 
students who were not required to take summer school. What we want to 
know is how the struggling students perform after taking summer school 
compared with how they would have done if they had not taken summer 
school. Yes, we could do some kind of controlled experiment in which 
struggling students are randomly selected to attend summer school or not, 
but that would involve denying the control group access to a program that 
we think would be helpful. 

Instead, the treatment and control groups are created by comparing those 
students who just barely fell below the threshold for summer school with 
those who just barely escaped it. Think about it: the students who fail a 
midterm are appreciably different from students who do not fail the 
midterm. But students who get a 59 percent (a failing grade) are not 
appreciably different from those students who get a 60 percent (a passing 
grade). If those who fail the midterm are enrolled in some treatment, such 
as mandatory tutoring for the final exam, then we would have a reasonable 
treatment and control group if we compared the final exam scores of those 
who barely failed the midterm (and received tutoring) with the scores of 
those who barely passed the midterm (and did not get tutoring). 

This approach was used to determine the effectiveness of incarceration 
for juvenile offenders as a deterrent to future crime. Obviously this kind of 
analysis cannot simply compare the recidivism rates of juvenile offenders 
who are imprisoned with the recidivism rates for juvenile offenders who 
received lighter sentences. The juvenile offenders who are sent to prison 
typically commit more serious crimes than the juvenile offenders who 
receive lighter sentences; that's why they go to prison. Nor can we create a 
treatment and control group by distributing prison sentences randomly 
(unless you want to risk twenty-five years in the big house the next time 
you make an illegal right turn on red). Randi Hjalmarsson, a researcher now 
at the University of London, exploited rigid sentencing guidelines for 
juvenile offenders in the state of Washington to gain insight into the causal 
effect of a prison sentence on future criminal behavior. Specifically, she 
compared the recidivism rate for those juvenile offenders who were “just 



barely” sentenced to prison with the recidivism rate for those juveniles who 
“just barely” got a pass (which usually involved a fine or probation). 9 

The Washington criminal justice system creates a grid for each convicted 
offender that is used to administer a sentence. The x-axis measures the 
offender’s prior adjudicated offenses. For example, each prior felony counts 
as one point; each prior misdemeanor counts as one-quarter point. The point 
total is rounded down to a whole number (which will matter in a moment). 
Meanwhile, the y-axis measures the severity of the current offense on a 
scale from E (least serious) to A+ (most serious). A convicted juvenile’s 
sentence is literally calculated by finding the appropriate box on the grid: 
An offender with two points’ worth of prior offenses who commits a Class 
B felony will receive fifteen to thirty-six months in a juvenile jail. A 
convicted offender with only one point worth of prior offenses who 
commits the same crime will not be sent to jail. That discontinuity is what 
motivated the research strategy. Hjalmarsson compared the outcomes for 
convicted offenders who fell just above and below the threshold for a jail 
sentence. As she explains in the paper, “If there are two individuals with a 
current offense class of C+ and [prior] adjudication scores of 2 3 A and 3, 
then only the latter individual will be sentenced to state incarceration.” 

For research purposes, those two individuals are essentially the same— 
until one of them goes to jail. And at that point, their behavior does appear 
to diverge sharply. The juvenile offenders who go to jail are significantly 
less likely to be convicted of another crime (after they are released from 
jail). 

We care about what works. This is true in medicine, in economics, in 
business, in criminal justice—in everything. Yet causality is a tough nut to 
crack, even in cases where cause and effect seems stunningly obvious. To 
understand the true impact of a treatment, we need to know the 
“counterfactual,” which is what would have happened in the absence of that 
treatment or intervention. Often the counterfactual is difficult or impossible 
to observe. Consider a nonstatistics example: Did the U.S. invasion of Iraq 
make America safer? 

There is only one intellectually honest answer: We will never know. The 
reason we will never know is that we do not know—and cannot know— 
what would have happened if the United States had not invaded Iraq. True, 


the United States did not find weapons of mass destruction. But it is 
possible that on the day after the United States did not invade Iraq Saddam 
Hussein could have climbed into the shower and said to himself, “I could 
really use a hydrogen bomb. I wonder if the North Koreans will sell me 
one?” After that, who knows? 

Of course, it’s also possible that Saddam Hussein could have climbed 
into that same shower on the day after the United States did not invade Iraq 
and said to himself, “I could really use—” at which point he slipped on a 
bar of soap, hit his head on an ornate marble fixture, and died. In that case, 
the world would have been rid of Saddam Hussein without the enormous 
costs associated with the U.S. invasion. Who knows what would have 
happened? 

The purpose of any program evaluation is to provide some kind of 
counterfactual against which a treatment or intervention can be measured. 
In the case of a randomized, controlled experiment, the control group is the 
counterfactual. In cases where a controlled experiment is impractical or 
immoral, we need to find some other way of approximating the 
counterfactual. Our understanding of the world depends on finding clever 
ways to do that. 


* The participants did know that they were participating in a clinical trial and might receive the sham 
surgery. 

* Researchers love to use the word “exploit.” It has a specific meaning in terms of taking advantage 
of some data-related opportunity. For example, when researchers find some natural experiment that 
creates a treatment and control group, they will describe how they plan to “exploit the variation in the 
data.” 

t There is potential for bias here. Both groups of students are talented enough to get into a highly 
selective school. However, one group of students chose to go to such a school, and the other group 
did not. The group of students who chose to attend a less selective school may be less motivated, less 
hardworking, or different in some other ways that we cannot observe. If Dale and Krueger had found 
that students who attend a highly selective school had higher lifetime earnings than students who 
were accepted at such a school but went to a less selective college instead, we still could not be 
certain whether the difference was due to the selective school or to the kind of student who opted to 
attend such a school when given a choice. This potential bias turns out to be unimportant in the Dale 
and Krueger study, however, because of its direction. Dale and Krueger find that the students who 
attended highly selective schools did not earn significantly more in life than students who were 
accepted but went elsewhere despite the fact that the students who declined to attend a highly 
selective school may have had attributes that caused them to earn less in life apart from their 
education. If anything, the bias here causes the findings to overstate the pecuniary benefits of 
attending a highly selective college—which turn out to be insubstantial anyway. 


Conclusion 

Five questions that statistics 
can help answer 


Not that long ago, information was much harder to gather and far more 
expensive to analyze. Imagine studying the information from one million 
credit card transactions in the era—only a few decades back—when there 
were merely paper receipts and no personal computers for analyzing the 
accumulated data. During the Great Depression, there were no official 
statistics with which to gauge the depth of the economic problems. 
Government did not collect official information on either gross domestic 
product (GDP) or unemployment, meaning that politicians were attempting 
to do the economic equivalent of navigating through a forest without a 
compass. Herbert Hoover declared that the Great Depression was over in 
1930, on the basis of the inaccurate and outdated data that were available. 
He told the country in his State of Union address that two and a half million 
Americans were out of work. In fact, five million Americans were jobless, 
and unemployment was climbing by one hundred thousand every week. As 
James Surowiecki recently observed in The New Yorker, “Washington was 
making policy in the dark.” 1 

We are now awash in data. For the most part, that is a good thing. The 
statistical tools introduced in this book can be used to address some of our 
most significant social challenges. In that vein, I thought it fitting to finish 
the book with questions, not answers. As we try to digest and analyze 
staggering quantities of information, here are five important (and 
admittedly random) questions whose socially significant answers will 
involve many of the tools introduced in this book. 


WHAT IS THE FUTURE OF FOOTBALL? 


In 2009, Malcolm Gladwell posed a question in a New Yorker article that 
first struck me as needlessly sensationalist and provocative: How different 
are dog fighting and football? 2 The connection between the two activities 
stemmed from the fact that quarterback Michael Vick, who had served time 
in prison for his involvement in a dog-fighting ring, had been reinstated in 
the National Football League just as information was beginning to emerge 
that football-related head trauma may be associated with depression, 
memory loss, dementia, and other neurological problems later in life. 
Gladwell’s central premise was that both professional football and dog 
fighting are inherently devastating to the participants. By the end of the 
article, I was convinced that he had raised an intriguing point. 

Here is what we know. There is mounting evidence that concussions and 
other brain injuries associated with playing football can cause serious and 
permanent neurological damage. (Similar phenomena have been observed 
in boxers and hockey players.) Many prominent former NFL players have 
shared publicly their post-football battles with depression, memory loss, 
and dementia. Perhaps the most poignant was Dave Duerson, a former 
safety and Super Bowl winner for the Chicago Bears, who committed 
suicide by shooting himself in the chest; he left explicit instructions for his 
family to have his brain studied after his death. 

In a phone survey of a thousand randomly selected former NFL players 
who had played at least three years in the league, 6.1 percent of the former 
players over fifty reported that they had received a diagnosis of “dementia, 
Alzheimer’s disease, or other memory-related disease.” That’s five times 
the national average for that age group. For younger players, the rate of 
diagnosis was nineteen times the national average. Hundreds of former NFL 
players have now sued both the league and the makers of football helmets 
for allegedly hiding information about the dangers of head trauma. 3 

One of the researchers studying the impacts of brain trauma is Ann 
McKee, who runs the neuropathology laboratory at the Veterans Hospital in 
Bedford, Massachusetts. (Coincidentally, McKee also does the 
neuropathology work for the Framingham Heart Study.) Dr. McKee has 
documented the buildup of abnormal proteins called tau in the brains of 
athletes who have suffered brain trauma, such as boxers and football 
players. This leads to a condition known as chronic traumatic 


encephalopathy, or CTE, which is a progressive neurological disorder that 
has many of the same manifestations as Alzheimer’s. 

Meanwhile, other researchers have been documenting the connection 
between football and brain trauma. Kevin Guskiewicz, who runs the Sports 
Concussion Research Program at the University of North Carolina, has 
installed sensors on the inside of the helmets of North Carolina football 
players to record the force and nature of blows to the head. According to his 
data, players routinely receive blows to the head with a force equivalent to 
hitting the windshield in a car crash at twenty-five miles per hour. 

Here is what we don’t know. Is the brain injury evidence uncovered so 
far representative of the long-term neurological risks that all professional 
football players face? Or might this just be a “cluster” of adverse outcomes 
that is a statistical aberration? Even if it turns out that football players do 
face significantly higher risks of neurological disorder later in life, we 
would still have to probe the causality. Might the kind of men who play 
football (and boxing and hockey) be prone to this kind of problem? Is it 
possible that some other factors, such as steroid use, are contributing to the 
neurological problems later in life? 

If the accumulating evidence does suggest a clear, causal link between 
playing football and long-term brain injury, one overriding question will 
have to be addressed by players (and the parents of younger players), 
coaches, lawyers, NFL officials, and perhaps even government regulators: 
Is there a way to play the game of football that reduces most or all of the 
head trauma risk? If not, then what? This is the point behind Malcolm 
Gladwell’s comparison of football and dog fighting. He explains that dog 
fighting is abhorrent to the public because the dog owner willingly submits 
his dog to a contest that culminates in suffering and destruction. “And 
why?” he asks. “For the entertainment of an audience and the chance of a 
payday. In the nineteenth century, dog fighting was widely accepted by the 
American public. But we no longer find that kind of transaction morally 
acceptable in a sport.” 

Nearly every kind of statistical analysis described in this book is 
currently being used to figure out whether or not professional football as we 
know it now has a future. 



WHAT (IF ANYTHING) IS CAUSING THE 
DRAMATIC RISE IN THE INCIDENCE OF AUTISM? 


In 2012, the Centers for Disease Control reported that 1 in 88 American 
children has been diagnosed with an autism spectrum disorder (on the basis 
of data from 2008). The rate of diagnosis had climbed from 1 in 110 in 
2006, and 1 in 150 in 2002—or nearly a doubling in less than a decade. 
Autism spectrum disorders (ASDs) are a group of developmental 
disabilities characterized by atypical development in socialization, 
communication, and behavior. The “spectrum” indicates that autism 
encompasses a broad range of behaviorally defined conditions. 5 Boys are 
five times as likely to be diagnosed with an ASD as girls (meaning that the 
incidence for boys is even higher than 1 in 88). 

The first intriguing statistical question is whether we are experiencing an 
epidemic of autism, an “epidemic of diagnosis,” or some combination of the 
two? 6 In previous decades, children with an autism spectrum disorder had 
symptoms that might have gone undiagnosed, or their developmental 
challenges might have been described more generally as a “learning 
disability.” Doctors, parents, and teachers are now much more aware of the 
symptoms of ASDs, which naturally leads to more diagnoses regardless of 
whether or not the incidence of autism is on the rise. 

In any case, the shockingly high incidence of ASDs represents a serious 
challenge for families, for schools, and for the rest of society. The average 
lifetime cost of managing an autism spectrum disorder for a single 
individual is $3.5 million. Despite what is clearly an epidemic, we know 
amazingly little about what causes the condition. Thomas Insel, director of 
the National Institute of Mental Health, has said, “Is it cell phones? 
Ultrasound? Diet sodas? Every parent has a theory. At this point, we just 
don’t know.” 8 

What is different or unique about the lives and backgrounds of children 
with ASDs? What are the most significant physiological differences 
between children with and without an ASD? Is the incidence of ASDs 
different across countries? If so, why? Traditional statistical detective work 
is turning up clues. 

One recent study by researchers at the University of California at Davis 
identified ten locations in California with autism rates that are double the 


rates of surrounding areas; each of the autism clusters is a neighborhood 
with a concentration of white, highly educated parents. 9 Is that a clue, or a 
coincidence? Or might it reflect that relatively privileged families are more 
likely to have an autism spectrum disorder diagnosed? The same 
researchers are also conducting a study in which they will collect dust 
samples from the homes of 1,300 families with an autistic child to test for 
chemicals or other environmental contaminants than may play a causal role. 

Meanwhile, other researchers have identified what appears to be a 
genetic component to autism by studying ASDs among identical and 
fraternal twins. 10 The likelihood that two children in the same family have 
an ASD is higher among identical twins (who share the same genetic 
makeup) than among fraternal twins (whose genetic similarity is the same 
as for regular siblings). This finding does not rule out significant 
environmental factors, or perhaps the interaction between environmental 
and genetic factors. After all, heart disease has a significant genetic 
component, but clearly smoking, diet, exercise, and many other behavioral 
and environmental factors all matter, too. 

One of the most important contributions of statistical analysis so far has 
been to debunk false causes, many of which have arisen because of a 
confusion between correlation and causation. An autism spectrum disorder 
often appears suddenly between a child’s first and second birthdays. This 
has led to a widespread belief that childhood vaccinations, particularly the 
triple vaccine for measles, mumps, and rubella (MMR), are causing the 
rising incidence of autism. Dan Burton, a member of Congress from 
Indiana, told the New York Times, “My grandson received nine shots in one 
day, seven of which contained thimerosal, which is 50 percent mercury as 
you know, and he became autistic a short time later.” 11 

Scientists have soundly refuted the false association between thimerosal 
and ASDs. Autism rates did not decline when thimerosal was removed from 
the MMR vaccine, nor are autism rates lower in countries that never used 
this vaccine. Nonetheless, the false connection persists, which has caused 
some parents to refuse to vaccinate their children. Ironically, this offers no 
protection against autism while putting children at risk of contracting other 
serious diseases (and contributing to the spread of those diseases in the 
population). 


Autism poses one of the greatest medical and social challenge of our day. 
We understand so little about the disorder relative to its huge (and possibly 
growing) impact on our collective well-being. Researchers are using every 
tool in this book (and lots more) to change that. 

HOW CAN WE IDENTIFY AND REWARD 
GOOD TEACHERS AND SCHOOLS? 

We need good schools. And we need good teachers in order to have good 
schools. Thus, it follows logically that we ought to reward good teachers 
and good schools while firing bad teachers and closing bad schools. 

How exactly do we do that? 

Test scores give us an objective measure of student performance. Yet we 
know that some students will do much better on standardized tests than 
others for reasons that have nothing to do with what is going on inside a 
classroom or a school. The seemingly simple solution is to evaluate schools 
and teachers on the basis of the progress that their students make over some 
period of time. What did students know when they started in a certain 
classroom with a particular teacher? What did they know a year later? The 
difference is the “value added” in that classroom. 

We can even use statistics to get a more refined sense of this value added 
by taking into account the demographic characteristics of the students in a 
given classroom, such as race, income, and performance on other tests 
(which can be a measure of aptitude). If a teacher makes significant gains 
with students who have typically struggled in the past, then he or she can be 
deemed as highly effective. 

Voila! We can now evaluate teacher quality with statistical precision. 
And the good schools, of course, are just the ones full of effective teachers. 

How do these handy statistical evaluations work in practice? In 2012, 
New York City took the plunge and published ratings of all 18,000 public 
school teachers on the basis of a “value-added assessment” that measured 
gains in their students’ test scores while taking into account various student 
characteristics. 1 The Los Angeles Times published a similar set of rankings 
for Los Angeles teachers in 2010. 

In both New York and LA, the reaction has been loud and mixed. Arne 
Duncan, the U.S. secretary of education, has generally been supportive of 


these kinds of value-added assessments. They provide information where 
none previously existed. After the Los Angeles data were published. 
Secretary Duncan told the New York Times, “Silence is not an option.” The 
Obama administration has provided financial incentives for states to 
develop value-added indicators for paying and promoting teachers. 
Proponents of these evaluation measures rightfully point out that they are a 
huge potential improvement over systems in which all teachers are paid 
according to a uniform salary schedule that gives zero weight to any 
measure of performance in the classroom. 

On the other hand, many experts have warned that these kinds of teacher 
assessment data have large margins of error and can deliver misleading 
results. The union representing New York City teachers spent more than 
$100,000 on a newspaper advertising campaign built around the headline 
“This Is No Way to Rate a Teacher.” 13 Opponents argue that the value- 
added assessments create false precision that will be abused by parents and 
public officials who do not understand the limitations of this kind of 
assessment. 

This appears to be a case where everybody is right—up to a point. Doug 
Staiger, an economist at Dartmouth College who works extensively with 
value-added data for teachers, warns that these data are inherently “noisy.” 
The results for a given teacher are often based on a single test taken on a 
single day by a single group of students. All kinds of factors can lead to 
random fluctuations—anything from a particularly difficult group of 
students to a broken air-conditioning unit clanking away in the classroom 
on test day. The correlation in performance from year to year for a single 
teacher that uses these indicators is only about .35. (Interestingly, the 
correlation in year-to-year performance for Major League baseball players 
is also around .35, as measured by batting average for hitters and earned run 
average for pitchers.) 14 

The teacher effectiveness data are useful, says Staiger, but they are just 
one tool in the process for evaluating teacher performance. The data get 
“less noisy” when authorities have more years of data for a particular 
teacher with different classrooms of students (just as we can tell more about 
an athlete when we have data for more games and more seasons). In the 
case of the New York City teacher ratings, principals in the system had been 


prepped on the appropriate use of the value-added data and the inherent 
limitations. The public did not get that briefing. As a result, the teacher 
assessments are too often viewed as a definitive guide to the “good” and 
“bad” teachers. We like rankings—just think U.S. News & World Report 
college rankings—even when the data do not support such precision. 

Staiger offers a final warning of different sort: We had better be certain 
that the outcomes we are measuring, such as the results of a given 
standardized test, truly track with what we care about in the long run. Some 
unique data from the Air Force Academy suggest, not surprisingly, that the 
test scores that glimmer now may not be gold in the future. The Air Force 
Academy, like the other military academies, randomly assigns its cadets to 
different sections of standardized core courses, such as introductory 
calculus. This randomization eliminates any potential selection effect when 
comparing the effectiveness of professors; over time, we can assume that all 
professors get students with similar aptitudes (unlike most universities, 
where students of different abilities can select into or out of different 
courses). The Air Force Academy also uses the same syllabi and exams in 
every section of a particular course. Scott Carrell and James West, 
professors at the University of California at Davis and the Air Force 
Academy, exploited this elegant arrangement to answer one of the most 
important questions in higher education: Which professors are most 
effective? 15 

The answer: The professors with less experience and fewer degrees from 
fancy universities. These professors have students who typically do better 
on the standardized exams for the introductory courses. They also get better 
student evaluations for their courses. Clearly these young, motivated 
instructors are more committed to their teaching than the old, crusty 
professors with PhDs from places like Harvard. The old guys must be using 
the same yellowing teaching notes that they used in 1978; they probably 
think PowerPoint is an energy drink—except that they don’t know what an 
energy drink is either. Obviously the data tell us that we should fire these 
old codgers, or at least let them retire gracefully. 

But hold on. Let’s not fire anybody yet. The Air Force Academy study 
had another relevant finding—about student performance over a longer 
horizon. Carrell and West found that in math and science the students who 



had more experienced (and more highly credentialed) instructors in the 
introductory courses do better in their mandatory follow-on courses than 
students who had less experienced professors in the introductory courses. 
One logical interpretation is that less experienced instructors are more 
likely to “teach to the test” in the introductory course. This produces 
impressive exam scores and happy students when it comes to filling out the 
instructor evaluation. 

Meanwhile, the old, crusty professors (whom we nearly fired just one 
paragraph ago) focus less on the exam and more on the important concepts, 
which are what matter most in follow-on courses and in life after the Air 
Force Academy. 

Clearly we need to evaluate teachers and professors. We just have to 
make sure that we do it right. The long-term policy challenge, rooted in 
statistics, is to develop a system that rewards a teacher’s real value added in 
the classroom. 

WHAT ARE THE BEST TOOLS 
FOR FIGHTING GLOBAL POVERTY? 

We know strikingly little about how to make poor countries less poor. True, 
we understand the things that distinguish rich countries from poor countries, 
such as their education levels and the quality of their governments. And it is 
also true that we have watched countries like India and China transform 
themselves economically over the last several decades. But even with this 
knowledge, it is not obvious what steps we can take to make places like 
Mali or Burkina Faso, less poor. Where should we begin? 

French economist Esther Duflo is transforming our knowledge of global 
poverty by retrofitting an old tool for new purposes: the randomized, 
controlled experiment. Duflo, who teaches at MIT, literally conducts 
experiments on different interventions to improve the lives of the poor in 
developing countries. For example, one of the longstanding problems with 
schools in India is absenteeism among teachers, particularly in small, rural 
schools with only a single teacher. Duflo and her coauthor Rema Hanna 
tested a clever, technology-driven solution on a random sample of 60 one- 
teacher schools in the Indian state of Rajasthan. 16 Teachers in these 60 
experimental schools were offered a bonus for good attendance. Here is the 


creative part: The teachers were given cameras with tamperproof date and 
time stamps. They proved that they had showed up each day by having their 
picture taken with their students. 17 

Absenteeism dropped by half among teachers in the experimental schools 
compared with teachers in a randomly selected control group of 60 schools. 
Student test scores went up, and more students graduated into the next level 
of education. (I bet the photos are adorable, too!) 

One of Duflo’s experiments in Kenya involved giving a randomly 
selected group of farmers a small subsidy to buy fertilizer right after the 
harvest. Prior evidence suggested that fertilizer raises crop yields 
appreciably. Farmers were aware of this benefit, but when it came time to 
put a new crop into the ground, they often did not have enough money left 
over from the last crop to buy fertilizer. This perpetuates what is known as a 
“poverty trap” since the subsistence farmers are too poor to make 
themselves less poor. Duflo and her coauthors found that a tiny subsidy— 
free fertilizer delivery—offered to farmers when they still had cash after the 
harvest increased fertilizer use by 10 to 20 percentage points compared with 
use in a control group. 

Esther Duflo has even waded into the gender war. Who is more 
responsible when it comes to handling the family’s finances, men or 
women? In rich countries, this is the kind of thing that couples can squabble 
over in marriage counseling. In poor countries, it can literally determine 
whether the children get enough to eat. Anecdotal evidence going back to 
the dawn of civilization suggests that women place a high priority on the 
health and welfare of their children, while men are more inclined to drink 
up their wages at the local pub (or whatever the caveman equivalent was). 
At worst, this anecdotal evidence merely reinforces age-old stereotypes. At 
best, it is a hard thing to prove, because a family’s finances are comingled 
to some extent. How can we separate out how husbands and wives choose 
to spend communal resources? 

Duflo did not shy away from this delicate question. 11 To the contrary, she 
found a fascinating natural experiment. In Cote d’Ivoire, women and men in 
a family typically share responsibility for some crops. For longstanding 
cultural reasons, men and women also cultivate different cash crops of their 
own. (Men grow cocoa, coffee, and some other things; women grow 


plantains, coconuts, and a few other crops.) The beauty of this arrangement 
from a research standpoint is that the men’s crops and the women’s crops 
respond to rainfall patterns in different ways. In years in which cocoa and 
coffee do well, men have more disposable income to spend. In years in 
which plantains and coconuts do well, the women have more extra cash. 

Now we need merely broach a delicate question: Are the children in 
these families better-off in years in which the men’s crops do well, or in the 
years when the women have a particularly bountiful harvest? 

The answer: When the women do well, they spend some of their extra 
cash on more food for the family. The men don’t. Sorry guys. 

In 2010, Duflo was awarded the John Bates Clark Medal. This prize is 
presented by the American Economic Association to the best economist 
under the age of forty. Among economist geeks, this prize is considered to 
be more prestigious than the Nobel Prize in Economics because it was 
historically awarded only every two years. (Beginning with Duflo’s award 
in 2010, the medal is now presented annually.) In any case, the Clark Medal 
is the MVP award for people with thick glasses (metaphorically speaking). 

Duflo is doing program evaluation. Her work, and the work of others 
now using her methods, is literally changing the lives of the poor. From a 
statistical standpoint, Duflo’s work has encouraged us to think more broadly 
about how randomized, controlled experiments—long thought to be the 
province of the laboratory sciences—can be used more widely to tease out 
causal relationships in many other areas of life. 

WHO GETS TO KNOW WHAT ABOUT YOU? 

Last summer, we hired a new babysitter. When she arrived at the house, I 
began to explain our family background: “I am a professor, my wife is a 
teacher. . .” 

“Oh, I know,” the sitter said with a wave of the hand. “I Googled you.” 

I was simultaneously relieved that I did not have to finish my spiel and 
mildly alarmed by how much of my life could be cobbled together from a 
short Internet search. Our capacity to gather and analyze huge quantities of 
data—the marriage of digital information with cheap computing power and 
the Internet—is unique in human history. We are going to need some new 
rules for this new era. 


Let’s put the power of data in perspective with just one example from the 
retailer Target. Like most companies, Target strives to increase profits by 
understanding its customers. To do that, the company hires statisticians to 
do the kind of “predictive analytics” described earlier in the book; they use 
sales data combined with other information on consumers to figure out who 
buys what and why. Nothing about this is inherently bad, for it means that 
the Target near you is likely to have exactly what you want. 

But let’s drill down for a moment on just one example of the kinds of 
things that the statisticians working in the windowless basement at 
corporate headquarters can figure out. Target has learned that pregnancy is a 
particularly important time in terms of developing shopping patterns. 
Pregnant women develop “retail relationships” that can last for decades. As 
a result, Target wants to identify pregnant women, particularly those in their 
second trimester, and get them into their stores more often. A writer for the 
New York Times Magazine followed the predictive analytics team at Target 
as it sought to find and attract pregnant shoppers. 20 

The first part is easy. Target has a baby shower registry in which 
pregnant women register for baby gifts in advance of the birth of their 
children. These women are already Target shoppers, and they’ve effectively 
told the store that they are pregnant. But here is the statistical twist: Target 
figured out that other women who demonstrate the same shopping patterns 
are probably pregnant, too. For example, pregnant women often switch to 
unscented lotions. They begin to buy vitamin supplements. They start 
buying extrabig bags of cotton balls. The Target predictive analytics gurus 
identified twenty-five products that together made possible a “pregnancy 
prediction score.” The whole point of this analysis was to send pregnant 
women pregnancy-related coupons in hopes of hooking them as long-term 
Target shoppers. 

How good was the model? The New York Times Magazine reported a 
story about a man from Minneapolis who walked into a Target store and 
demanded to see a manager. The man was irate that his high school 
daughter was being bombarded with pregnancy-related coupons from 
Target. “She’s still in high school and you’re sending her coupons for baby 
clothes and cribs? Are you trying to encourage her to get pregnant?” the 
man asked. 


The store manager apologized profusely. He even called the father 
several days later to apologize again. Only this time, the man was less irate; 
it was his turn to be apologetic. “It turns out there’s been some activities in 
my house I haven’t been completely aware of,” the father said. “She’s due 
in August.” 

The Target statisticians had figured out that his daughter was pregnant 
before he did. 

That is their business . . . and also not their business. It can feel more 
than a little intrusive. For that reason, some companies now mask how 
much they know about you. For example, if you are a pregnant woman in 
your second trimester, you may get some coupons in the mail for cribs and 
diapers—along with a discount on a riding lawn mower and a coupon for 
free bowling socks with the purchase of any pair of bowling shoes. To you, 
it just seems fortuitous that the pregnancy-related coupons came in the mail 
along with the other junk. In fact, the company knows that you don’t bowl 
or cut your own lawn; it’s merely covering its tracks so that what it knows 
about you doesn’t seem so spooky. 

Facebook, a company with virtually no physical assets, has become one 
of the most valuable companies in the world. To investors (as opposed to 
users), Facebook has one enormous asset: data. Investors don’t love 
Facebook because it allows them to reconnect with their prom dates. They 
love Facebook because every click of the mouse yields data about where 
users live, where they shop, what they buy, who they know, and how they 
spend their time. To users, who are hoping to reconnect with their prom 
dates, the corporate data gathering can overstep the boundaries of privacy. 

Chris Cox, Facebook’s vice president of product, told the New York 
Times, “The challenge of the information age is what to do with it.” 21 

Yep. 

And in the public arena, the marriage of data and technology gets even 
trickier. Cities around the world have installed thousands of security 
cameras in public places, some of which will soon have facial recognition 
technology. Law enforcement authorities can follow any car anywhere it 
may go (and keep extensive records of where it has been) by attaching a 
global positioning device to the vehicle and then tracking it by satellite. Is 
this a cheap and efficient way to monitor potential criminal activity? Or is 


this the government using technology to trample on our personal liberty? In 
2012, the U.S. Supreme Court decided unanimously that it was the latter, 
ruling that law enforcement officials can no longer attach tracking devices 
to private vehicles without a warrant.* 

Meanwhile, governments around the world maintain huge DNA 
databases that are a powerful tool for solving crimes. Whose DNA should 
be in the database? That of all convicted criminals? That of every person 
who is arrested (whether or not eventually convicted)? Or a sample from 
every one of us? 

We are just beginning to wrestle with the issues that lie at the intersection 
of technology and personal data—none of which were terribly relevant 
when government information was stored in dusty basement filing cabinets 
rather than in digital databases that are potentially searchable by anyone 
from anywhere. Statistics is more important than ever before because we 
have more meaningful opportunities to make use of data. Yet the formulas 
will not tell us which uses of data are appropriate and which are not. Math 
cannot supplant judgment. 

In that vein, let’s finish the book with some word association: fire, knives, 
automobiles, hair removal cream. Each one of these things serves an 
important purpose. Each one makes our lives better. And each one can 
cause some serious problems when abused. 

Now you can add statistics to that list. Go forth and use data wisely and 
well! 


* I was ineligible for the 2010 prize since I was over forty. Also, I’d done nothing to deserve it. 

* The United States v. Jones. 


Appendix 

Statistical software 


I suspect that you won’t be doing your statistical analysis with a pencil, 
paper, and calculator. Here is a quick tour of the software packages most 
commonly used for the kinds of tasks described in this book. 

Microsoft Excel 

Microsoft Excel is probably the most widely used program to compute 
simple statistics such as mean and standard deviation. Excel can also do 
basic regression analysis. Most computers come loaded with Microsoft 
Office, so Excel is probably sitting on your desk right now. Excel is user- 
friendly compared with more sophisticated statistical software packages. 
The basic statistical calculations can be done by means of the formula bar. 

Excel cannot perform some of the advanced tasks that more specialized 
programs can do. However, there are Excel extensions that you can buy 
(and some that you can download for free) that will expand the program’s 
statistical capabilities. One huge advantage to Excel is that it offers simple 
ways to display two-dimensional data with visually appealing graphics. 
These graphics can be easily dropped into Microsoft PowerPoint and 
Microsoft Word. 

Stata 

Stata is a statistical package used worldwide by research professionals; its 
interface has a serious, academic feel. Stata has a wide range of capabilities 
to do basic tasks, such as creating data tables and calculating descriptive 
statistics. Of course, that is not why university professors and other serious 
researchers choose Stata. The software is designed to handle sophisticated 


statistical tests and data modeling that are far beyond the kinds of things 
described in this book. 

Stata is a great fit for those who have a solid understanding of statistics 
(a basic understanding of programming also helps) and those who do not 
need fancy formatting—-just the answers to their statistical queries. Stata is 
not the best choice if your goal is to produce quick graphics from the data. 
Expert users say that Stata can produce nice graphics but that Excel is easier 
to use for that purpose. 

Stata offers several different stand-alone software packages. You can 
either license the product for a year (after a year, the software no longer 
works on your computer) or license it forever. One of the cheapest options 
is Stata/IC, which is designed for “students and researchers with moderate¬ 
sized datasets.” There is a discount for users who are in the education 
sector. Even then, a single-user annual license for Stata/IC is $295 and a 
perpetual license is $595. If you plan to launch a satellite to Mars and need 
to do some really serious number crunching, you can look into more 
advanced Stata packages, which can cost thousands of dollars. 

SAS 1 " 

SAS has a broad appeal not only to professional researchers but also to 
business analysts and engineers because of its broad range of analytical 
capabilities. SAS sells two different statistical packages. The first is called 
SAS Analytics Pro, which can read data in virtually any format and perform 
advanced data analysis. The software also has good data visualization tools, 
such as advanced mapping capabilities. It’s not cheap. Even for those in the 
education and government sectors, a single commercial or individual 
license for this package is $8,500, plus an annual license fee. 

The second SAS statistical package is SAS Visual Data Discovery. It has 
an easy-to-use interface that requires no knowledge of coding or 
programming, while still providing advanced data analysis capabilities. As 
its name suggests, this package is meant to allow the user to easily explore 
data with interactive visualization. You can also export the data animations 
into presentations, Web pages, and other documents. This one is not cheap 
either. A single commercial or individual license for this package is $9,810, 
plus an annual license fee. 


SAS sells some specialized management tools, such as a product that 
uses statistics to detect fraud and financial crimes. 

R 

This may sound like a character in a James Bond movie. In fact, R is a 
popular statistical package that is free or “open source.” It can be 
downloaded and easily installed on your computer in a matter of minutes. 
There is also an active “R community” that shares code and can offer help 
and guidance when needed. 

Not only is R the cheapest option, but it is also one of the most malleable 
of all of the packages described here. Depending on your perspective, this 
flexibility is either frustrating or one of R’s great assets. If you are new to 
statistical software, the program offers almost no structure. The interface 
will not help you along much. On the other hand, programmers (and even 
people who have just a basic familiarity with coding principles) can find the 
lack of structure liberating. Users are free to tell the program to do exactly 
what they want it to do, including having it work with outside programs. 

IBM SPSS 

IBM SPSS has something for everyone, from hard-core statisticians to less 
statistically rugged business analysts. IBM SPSS is good for beginners 
because it offers a menu-driven interface. IBM SPSS also offers a range of 
tools or “modules” that are designed to perform specific functions, such as 
IBM SPSS Forecasting, IBM SPSS Advanced Statistics, IBM SPSS 
Visualization Designer, and IBM SPSS Regression. The modules can be 
purchased individually or combined into packages. 

The most basic package offered is IBM SPSS Statistics Standard Edition, 
which allows you to calculate simple statistics and perform basic data 
analysis, such as identifying trends and building predictive models. A single 
fixed-term commercial license is $2,250. The premium package, which 
includes most of the modules, is $6,750. Discounts are available for those 
who work in the education sector. 


* See http://www.stata.com/. 

t See http://www.sas.com/technologies/analytics/statistics/. 


* See http://www-01.ibm.com/software/analytics/spss/products/statistics/. 
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INTRODUCTION 


Virtually everything in life is, to some extent, uncertain. This may seem like a bit of an exaggeration, 
but to see the truth of it you can try a quick experiment. At the start of the day, write down 
something you think will happen in the next half-hour, hour, three hours, and six hours. Then see 
how many of these things happen exactly like you imagined. You'll quickly realize that your day is 
full of uncertainties. Even something as predictable as "I will brush my teeth” or "I'll have a cup of 
coffee” may not, for some reason or another, happen as you expect. 

For most of the uncertainties in life, we’re able to get by quite well by planning our day. For 
example, even though traffic might make your morning commute longer than usual, you can make a 
pretty good estimate about what time you need to leave home in order to get to work on time. If 
you have a super-important morning meeting, you might leave earlier to allow for delays. We all 
have an innate sense of how to deal with uncertain situations and reason about uncertainty. When 
you think this way, you’re starting to think probabilistically. 

WHY LEARN STATISTICS? 

The subject of this book, Bayesian statistics, helps us get better at reasoning about uncertainty, just 
as studying logic in school helps us to see the errors in everyday logical thinking. Given that 
virtually everyone deals with uncertainty in their daily life, as we just discussed, this makes the 
audience for this book pretty wide. Data scientists and researchers already using statistics will 
benefit from a deeper understanding and intuition for how these tools work. Engineers and 
programmers will learn a lot about how they can better quantify decisions they have to make (I've 
even used Bayesian analysis to identify causes of software bugs!). Marketers and salespeople can 
apply the ideas in this book when running A/B tests, trying to understand their audience, and 
better assessing the value of opportunities. Anyone making high-level decisions should have at least 
a basic sense of probability so they can make quick back-of-the-envelope estimates about the costs 
and benefits of uncertain decisions. I wanted this book to be something a CEO could study on a 
flight and develop a solid enough foundation by the time they land to better assess choices that 
involve probabilities and uncertainty. 

I honestly believe that everyone will benefit from thinking about problems in a Bayesian way. With 
Bayesian statistics, you can use mathematics to model that uncertainty so you can make better 
choices given limited information. For example, suppose you need to be on time for work for a 
particularly important meeting and there are two different routes you could take. The first route is 
usually faster, but has pretty regular traffic back-ups that can cause huge delays. The second route 
takes longer in general but is less prone to traffic. Which route should you take? What type of 
information would you need to decide this? And how certain can you be in your choice? Even just a 
small amount of added complexity requires some extra thought and technique. 



Typically when people think of statistics, they think of scientists working on a new drug, 
economists following trends in the market, analysts predicting the next election, baseball managers 
trying to build the best team with fancy math, and so on. While all of these are certainly fascinating 
uses of statistics, understanding the basics of Bayesian reasoning can help you in far more areas in 
everyday life. If you've ever questioned some new finding reported in the news, stayed up late 
browsing the web wondering if you have a rare disease, or argued with a relative over their 
irrational beliefs about the world, learning Bayesian statistics will help you reason better. 

WHAT IS "BAYESIAN" STATISTICS? 

You may be wondering what all this "Bayesian” stuff is. If you’ve ever taken a statistics class, it was 
likely based on frequentist statistics. Frequentist statistics is founded on the idea that probability 
represents the frequency with which something happens. If the probability of getting heads in a 
single coin toss is 0.5, that means after a single coin toss we can expect to get one-half of a head of a 
coin (with two tosses we can expect to get one head, which makes more sense). 

Bayesian statistics, on the other hand, is concerned with how probabilities represent how uncertain 
we are about a piece of information. In Bayesian terms, if the probability of getting heads in a coin 
toss is 0.5, that means we are equally unsure about whether we’ll get heads or tails. For problems 
like coin tosses, both frequentist and Bayesian approaches seem reasonable, but when you’re 
quantifying your belief that your favorite candidate will win the next election, the Bayesian 
interpretation makes much more sense. After all, there’s only one election, so speaking about how 
frequently your favorite candidate will win doesn’t make much sense. When doing Bayesian 
statistics, we’re just trying to accurately describe what we believe about the world given the 
information we have. 

One particularly nice thing about Bayesian statistics is that, because we can view it simply as 
reasoning about uncertain things, all of the tools and techniques of Bayesian statistics make 
intuitive sense. 

Bayesian statistics is about looking at a problem you face, figuring out how you want to describe it 
mathematically, and then using reason to solve it. There are no mysterious tests that give results 
that you aren’t quite sure of, no distributions you have to memorize, and no traditional experiment 
designs you must perfectly replicate. Whether you want to figure out the probability that a new 
web page design will bring you more customers, if your favorite sports team will win the next game, 
or if we really are alone in the universe, Bayesian statistics will allow you to start reasoning about 
these things mathematically using just a few simple rules and a new way of looking at problems. 

WHAT'S IN THIS BOOK 

Here’s a quick breakdown of what you’ll find in this book. 

Part I: Introduction to Probability 

Chapter 1: Bayesian Thinking and Everyday Reasoning This first chapter introduces you to 
Bayesian thinking and shows you how similar it is to everyday methods of thinking critically about 
a situation. We’ll explore the probability that a bright light outside your window at night is a UFO 
based on what you already know and believe about the world. 

Chapter 2: Measuring Uncertainty In this chapter you’ll use coin toss examples to assign actual 
values to your uncertainty in the form of probabilities: a number from 0 to 1 that represents how 
certain you are in your belief about something. 



Chapter 3: The Logic of Uncertainty In logic we use AND, NOT, and OR operators to combine true 
or false facts. It turns out that probability has similar notions of these operators. We'll investigate 
how to reason about the best mode of transport to get to an appointment, and the chances of you 
getting a traffic ticket. 

Chapter 4: Creating a Binomial Probability Distribution Using the rules of probability as logic, 
in this chapter, you'll build your own probability distribution, the binomial distribution, which you 
can apply to many probability problems that share a similar structure. You’ll try to predict the 
probability of getting a specific famous statistician collectable card in a Gacha card game. 

Chapter 5: The Beta Distribution Here you’ll learn about your first continuous probability 
distribution and get an introduction to what makes statistics different from probability. The 
practice of statistics involves trying to figure out what unknown probabilities might be based on 
data. In this chapter’s example, we’ll investigate a mysterious coin-dispensing box and the chances 
of making more money than you lose. 

Part II: Bayesian Probability and Prior Probabilities 

Chapter 6: Conditional Probability In this chapter, you'll condition probabilities based on your 
existing information. For example, knowing whether someone is male or female tells us how likely 
they are to be color blind. You’ll also be introduced to Bayes’ theorem, which allows us to reverse 
conditional probabilities. 

Chapter 7: Baves' Theorem with LEGO Here you’ll gain a better intuition for Bayes’ theorem by 
reasoning about LEGO bricks! This chapter will give you a spatial sense of what Bayes’ theorem is 
doing mathematically. 

Chapter 8: The Prior, Likelihood, and Posterior of Baves’ Theorem Bayes’ theorem is typically 
broken into three parts, each of which performs its own function in Bayesian reasoning. In this 
chapter, you’ll learn what they’re called and how to use them by investigating whether an apparent 
break-in was really a crime or just a series of coincidences. 

Chapter 9: Bayesian Priors and Working with Probability Distributions This chapter explores 
how we can use Bayes’ theorem to better understand the classic asteroid scene from Star Wars: The 
Empire Strikes Back, through which you’ll gain a stronger understanding of prior probabilities in 
Bayesian statistics. You’ll also see how you can use entire distributions as your prior. 

Part III: Parameter Estimation 

Chapter 10: Introduction to Averaging and Parameter Estimation Parameter estimation is the 
method we use to formulate a best guess for an uncertain value. The most basic tool in parameter 
estimation is to simply average your observations. In this chapter we'll see why this works by 
analyzing snowfall levels. 

Chapter 11: Measuring the Spread of Our Data Finding the mean is a useful first step in 
estimating parameters, but we also need a way to account for how spread out our observations are. 
Here you’ll be introduced to mean absolute deviation (MAD), variance, and standard deviation as 
ways to measure how spread out our observations are. 

Chapter 12: The Normal Distribution By combining our mean and standard deviation, we get a 
very useful distribution for making estimates: the normal distribution. In this chapter, you’ll learn 
how to use the normal distribution to not only estimate unknown values but also to know how 
certain you are in those estimates. You’ll use these new skills to time your escape during a bank 
heist. 



Chapter 13: Tools of Parameter Estimation: The PDF, CDF, and Quantile Function Here you’ll 
learn about the PDF, CDF, and quantile function to better understand the parameter estimations 
you’re making. You'll estimate email conversion rates using these tools and see what insights each 
provides. 

Chapter 14: Parameter Estimation with Prior Probabilities The best way to improve our 
parameter estimates is to include a prior probability. In this chapter, you'll see how adding prior 
information about the past success of email click-through rates can help us better estimate the true 
conversion rate for a new email. 

Chapter 15: From Parameter Estimation to Hypothesis Testing: Building a Bayesian A/B 
Test Now that we can estimate uncertain values, we need a way to compare two uncertain values in 
order to test a hypothesis. You’ll create an A/B test to determine how confident you are in a new 
method of email marketing. 

Part IV: Hypothesis Testing: The Heart of Statistics 

Chapter 16: Introduction to the Baves Factor and Posterior Odds: The Competition of 
Ideas Ever stay up late, browsing the web, wondering if you might have a super-rare disease? This 
chapter will introduce another approach to testing ideas that will help you determine how worried 
you should actually be! 

Chapter 17: Bayesian Reasoning in The Twilight Zone How much do you believe in psychic 
powers? In this chapter, you'll develop your own mind-reading skills by analyzing a situation from a 
classic episode of The Twilight Zone. 

Chapter 18: When Data Doesn't Convince You Sometimes data doesn’t seem to be enough to 
change someone’s mind about a belief or help you win an argument. Learn how you can change a 
friend’s mind about something you disagree on and why it’s not worth your time to argue with your 
belligerent uncle! 

Chapter 19: From Hypothesis Testing to Parameter Estimation Here we come full circle back to 
parameter estimation by looking at how to compare a range of hypotheses. You’ll derive your first 
example of statistics, the beta distribution, using the tools that we’ve covered for simple hypothesis 
tests to analyze the fairness of a particular fairground game. 

A ppendix A: A Quick Introduction to R This quick appendix will teach you the basics of the R 
programming language. 

A ppendix B: Enough Calculus to Get By Here we’ll cover just enough calculus to get you 
comfortable with the math used in the book. 

BACKGROUND FOR READING THE BOOK 

The only requirement of this book is basic high school algebra. If you flip forward, you’ll see a few 
instances of math, but nothing particularly onerous. We’ll be using a bit of code written in the R 
programming language, which I’ll provide and talk through, so there’s no need to have learned R 
beforehand. We’ll also touch on calculus, but again no prior experience is required, and the 
appendixes will give you enough information to cover what you’ll need. 

In other words, this book aims to help you start thinking about problems in a mathematical way 
without requiring significant mathematical background. When you finish reading it, you may find 
yourself inadvertently writing down equations to describe problems you see in everyday life! 

If you do happen to have a strong background in statistics (even Bayesian statistics), I believe you’ll 
still have a fun time reading through this book. I have always found that the best way to understand 
a field well is to revisit the fundamentals over and over again, each time in a different light. Even as 



the author of this book, I found plenty of things that surprised me just in the course of the writing 
process! 


NOW OFF ON YOUR ADVENTURE! 

As you’ll soon see, aside from being very useful, Bayesian statistics can be a lot of fun! To help you 
learn Bayesian reasoning we’ll be taking a look at LEGO bricks, The Twilight Zone, Star Wars, and 
more. You'll find that once you begin thinking probabilistically about problems, you’ll start using 
Bayesian statistics all over the place. This book is designed to be a pretty quick and enjoyable read, 
so turn the page and let’s begin our adventure in Bayesian statistics! 



PARTI 

INTRODUCTION TO PROBABILITY 



1 

BAYESIAN THINKING AND EVERYDAY REASONING 


In this first chapter, I'll give you an overview of Bayesian reasoning, the formal process we use to 
update our beliefs about the world once we've observed some data. We’ll work through a scenario 
and explore how we can map our everyday experience to Bayesian reasoning. 

The good news is that you were already a Bayesian even before you picked up this book! Bayesian 
statistics is closely aligned with how people naturally use evidence to create new beliefs and reason 
about everyday problems; the tricky part is breaking down this natural thought process into a 
rigorous, mathematical one. 

In statistics, we use particular calculations and models to more accurately quantify probability. For 
now, though, we won’t use any math or models; we’ll just get you familiar with the basic concepts 
and use our intuition to determine probabilities. Then, in the next chapter, we’ll put exact numbers 
to probabilities. Throughout the rest of the book, you'll learn how we can use rigorous 
mathematical techniques to formally model and reason about the concepts we’ll cover in this 
chapter. 

REASONING ABOUT STRANGE EXPERIENCES 

One night you are suddenly awakened by a bright light at your window. You jump up from bed and 
look out to see a large object in the sky that can only be described as saucer shaped. You are 
generally a skeptic and have never believed in alien encounters, but, completely perplexed by the 
scene outside, you find yourself thinking, Could this be a UFO?! 

Bayesian reasoning involves stepping through your thought process when you’re confronted with a 
situation to recognize when you’re making probabilistic assumptions, and then using those 
assumptions to update your beliefs about the world. In the UFO scenario, you’ve already gone 
through a full Bayesian analysis because you: 

1. Observed data 

2. Formed a hypothesis 

3. Updated your beliefs based on the data 

This reasoning tends to happen so quickly that you don’t have any time to analyze your own 
thinking. You created a new belief without questioning it: whereas before you did not believe in the 
existence of UFOs, after the event you’ve updated your beliefs and now think you’ve seen a UFO. 

In this chapter, you'll focus on structuring your beliefs and the process of creating them so you can 
examine it more formally, and we’ll look at quantifying this process in chapters to come. 



Let’s look at each step of reasoning in turn, starting with observing data. 

Observing Data 

Founding your beliefs on data is a key component of Bayesian reasoning. Before you can draw any 
conclusions about the scene (such as claiming what you see is a UFO), you need to understand the 
data you’re observing, in this case: 

• An extremely bright light outside your window 

• A saucer-shaped object hovering in the air 

Based on your past experience, you would describe what you saw out your window as "surprising.” 
In probabilistic terms, we could write this as: 

Pfbright light outside window, saucer-shaped object in sky) = very low 

where P denotes probability and the two pieces of data are listed inside the parentheses. You would 
read this equation as: "The probability of observing bright lights outside the window and a saucer¬ 
shaped object in the sky is very low.” In probability theory, we use a comma to separate events 
when we’re looking at the combined probability of multiple events. Note that this data does not 
contain anything specific about UFOs; it’s simply made up of your observations—this will be 
important later. 

We can also examine probabilities of single events, which would be written as: 

P(rain) = likely 

This equation is read as: "The probability of rain is likely." 

For our UFO scenario, we’re determining the probability of both events occurring together. The 
probability of one of these two events occurring on its own would be entirely different. For 
example, the bright lights alone could easily be a passing car, so on its own the probability of this 
event is more likely than its probability coupled with seeing a saucer-shaped object (and the 
saucer-shaped object would still be surprising even on its own). 

So how are we determining this probability? Right now we’re using our intuition—that is, our 
general sense of the likelihood of perceiving these events. In the next chapter, we'll see how we can 
come up with exact numbers for our probabilities. 

Holding Prior Beliefs and Conditioning Probabilities 

You are able to wake up in the morning, make your coffee, and drive to work without doing a lot of 
analysis because you hold prior beliefs about how the world works. Our prior beliefs are collections 
of beliefs we’ve built up over a lifetime of experiences (that is, of observing data). You believe that 
the sun will rise because the sun has risen every day since you were born. Likewise, you might have 
a prior belief that when the light is red for oncoming traffic at an intersection, and your light is 
green, it’s safe to drive through the intersection. Without prior beliefs, we would go to bed terrified 
each night that the sun might not rise tomorrow, and stop at every intersection to carefully inspect 
oncoming traffic. 

Our prior beliefs say that seeing bright lights outside the window at the same time as seeing a 
saucer-shaped object is a rare occurrence on Earth. However, if you lived on a distant planet 
populated by vast numbers of flying saucers, with frequent interstellar visitors, the probability of 
seeing lights and saucer-shaped objects in the sky would be much higher. 

In a formula we enter prior beliefs after our data, separated with a | like so: 



f bright light outside window, saucer-shaped 
P = very low 

v object in sky| experience on Earth 


We would read this equation as: "The probability of observing bright lights and a saucer-shaped 
object in the sky, given our experience on Earth, is very low." 

The probability outcome is called a conditional probability because we are conditioning the 
probability of one event occurring on the existence of something else. In this case, we’re 
conditioning the probability of our observation on our prior experience. 

In the same way we used P for probability, we typically use shorter variable names for events and 
conditions. If you’re unfamiliar with reading equations, they can seem too terse at first. After a 
while, though, you'll find that shorter variable names aid readability and help you to see how 
equations generalize to larger classes of problems. We’ll assign all of our data to a single variable, D: 

D = bright light outside window, saucer-shaped object in sky 

So from now on when we refer to the probability of set of data, we'll simply say, P[D). 

Likewise, we use the variable X to represent our prior belief, like so: 

X = experience on Earth 

We can now write this equation as P[D \ X], This is much easier to write and doesn’t change the 
meaning. 

Conditioning on Multiple Beliefs 

We can add more than one piece of prior knowledge, too, if more than one variable is going to 
significantly affect the probability. Suppose that it’s July 4th and you live in the United States. From 
prior experience you know that fireworks are common on the Fourth of July. Given your experience 
on Earth and the fact that it’s July 4th, the probability of seeing lights in the sky is less unlikely, and 
even the saucer-shaped object could be related to some fireworks display. You could rewrite this 
equation as: 

f bright light outside window, saucer-shaped j 
P - low 

v object in sky| July 4th, experience on Earthy 

Taking both these experiences into account, our conditional probability changed from "very low" to 
"low." 

Assuming Prior Beliefs in Practice 

In statistics, we don’t usually explicitly include a condition for all of our existing experiences, 
because it can be assumed. For that reason, in this book we won’t include a separate variable for 
this condition. However, in Bayesian analysis, it’s essential to keep in mind that our understanding 
of the world is always conditioned on our prior experience in the world. For the rest of this chapter, 
we’ll keep the "experience on Earth" variable around to remind us of this. 

Forming a Hypothesis 

So far we have our data, D (that we have seen a bright light and a saucer-shaped object), and our 
prior experience, X. In order to explain what you saw, you need to form some kind of hypothesis —a 



model about how the world works that makes a prediction. Hypotheses can come in many forms. 
All of our basic beliefs about the world are hypotheses: 

• If you believe the Earth rotates, you predict the sun will rise and set at certain times. 

• If you believe that your favorite baseball team is the best, you predict they will win 
more than the other teams. 

• If you believe in astrology, you predict that the alignment of the stars will describe 
people and events. 

Hypotheses can also be more formal or sophisticated: 

• A scientist may hypothesize that a certain treatment will slow the growth of cancer. 

• A quantitative analyst in finance may have a model of how the market will behave. 

• A deep neural network may predict which images are animals and which ones are 
plants. 

All of these examples are hypotheses because they have some way of understanding the world and 
use that understanding to make a prediction about how the world will behave. When we think of 
hypotheses in Bayesian statistics, we are usually concerned with how well they predict the data we 
observe. 

When you see the evidence and think A UFO!, you are forming a hypothesis. The UFO hypothesis is 
likely based on countless movies and television shows you've seen in your prior experience. We 
would define our first hypothesis as: 

Hi- A UFO is in my back yard! 

But what is this hypothesis predicting? If we think of this situation backward, we might ask, "If 
there was a UFO in your back yard, what would you expect to see?" And you might answer, "Bright 
lights and a saucer-shaped object." Because Hi predicts the data D, when we observe our data given 
our hypothesis, the probability of the data increases. Formally we write this as: 

P[D\ H U X] » P[D\ X) 

This equation says: "The probability of seeing bright lights and a saucer-shaped object in the sky, 
given my belief that this is a UFO and my prior experience, is much higher [indicated by the double 
greater-than sign »] than just seeing bright lights and a saucer-shaped object in the sky without 
explanation." Here we’ve used the language of probability to demonstrate that our hypothesis 
explains the data. 

Spotting Hypotheses in Everyday Speech 

It’s easy to see a relationship between our everyday language and probability. Saying something is 
"surprising,” for example, might be the same as saying it has low-probability data based on our 
prior experiences. Saying something "makes sense" might indicate we have high-probability data 
based on our prior experiences. This may seem obvious once pointed out, but the key to 
probabilistic reasoning is to think carefully about how you interpret data, create hypotheses, and 
change your beliefs, even in an ordinary, everyday scenario. Without Hi, you’d be in a state of 
confusion because you have no explanation for the data you observed. 

GATHERING MORE EVIDENCE AND UPDATING YOUR BELIEFS 

Now you have your data and a hypothesis. However, given your prior experience as a skeptic, that 
hypothesis still seems pretty outlandish. In order to improve your state of knowledge and draw 



more reliable conclusions, you need to collect more data. This is the next step in statistical 
reasoning, as well as in your own intuitive thinking. 

To collect more data, we need to make more observations. In our scenario, you look out your 
window to see what you can observe: 

As you look toward the bright light outside, you notice more lights in the area. You also see that the 
large saucer-shaped object is held up by wires, and notice a camera crew. You hear a loud clap and 
someone call out "Cut!" 

You have, very likely, instantly changed your mind about what you think is happening in this scene. 
Your inference before was that you might be witnessing a UFO. Now, with this new evidence, you 
realize it looks more like someone is shooting a movie nearby. 

With this thought process, your brain has once again performed some sophisticated Bayesian 
analysis in an instant! Let’s break down what happened in your head in order to reason about 
events more carefully. 

You started with your initial hypothesis: 

Hi = A UFO has landed! 

In isolation, this hypothesis, given your experience, is extremely unlikely: 

P[Hi | X) = very, very low 

However, it was the only useful explanation you could come up with given the data you had 
available. When you observed additional data, you immediately realized that there’s another 
possible hypothesis—that a movie is being filmed nearby: 

H 2 - A film is being made outside your window 

In isolation, the probability of this hypothesis is also intuitively very low (unless you happen to live 
near a movie studio): 

P[H 2 | X) = very low 

Notice that we set the probability of Hi as "very, very low" and the probability of H 2 as just "very 
low." This corresponds to your intuition: if someone came up to you, without any data, and asked, 
"Which do you think is more likely, a UFO appearing at night in your neighborhood or a movie being 
filmed next door?" you would say the movie scenario is more likely than a UFO appearance. 

Now we just need a way to take our new data into account when changing our beliefs. 

COMPARING HYPOTHESES 

You first accepted the UFO hypothesis, despite it being unlikely, because you didn’t initially have 
any other explanation. Now, however, there’s another possible explanation—a movie being 
filmed—so you have formed an alternate hypothesis. Considering alternate hypotheses is the 
process of comparing multiple theories using the data you have. 

When you see the wires, film crew, and additional lights, your data changes. Your updated data are: 

Updated = bright lights, saucer-shaped object, 

wires, film crew, other lights, etc. . . . 



On observing this extra data, you change your conclusion about what was happening. Let’s break 
this process down into Bayesian reasoning. Your first hypothesis, Hi, gave you a way to explain your 
data and end your confusion, but with your additional observations Hi no longer explains the data 
well. We can write this as: 


P{D a pdated | Hi, X) = very, very low 


You now have a new hypothesis, H 2 , which explains the data much better, written as follows: 


P [D updated \ Hz, X] » PfAipdated | Hi, X] 

The key here is to understand that we’re comparing how well each of these hypotheses explains the 
observed data. When we say, "The probability of the data, given the second hypothesis, is much 
greater than the first,” we’re saying that what we observed is better explained by the second 
hypothesis. This brings us to the true heart of Bayesian analysis: the test of your beliefs is how well 
they explain the world. We say that one belief is more accurate than another because it provides a 
better explanation of the world we observe. 

Mathematically, we express this idea as the ratio of the two probabilities: 

I H„X) 

P(0^\H„X) 

When this ratio is a large number, say 1,000, it means "H 2 explains the data 1,000 times better 
than Hi." Because H 2 explains the data many times better than another Hi, we update our beliefs 
from Hi to H 2 . This is exactly what happened when you changed your mind about the likely 
explanation for what you observed. You now believe that what you’ve seen is a movie being made 
outside your window, because this is a more likely explanation of all the data you observed. 

DATA INFORMS BELIEF; BELIEF SHOULD NOT INFORM DATA 

One final point worth stressing is that the only absolute in all these examples is your data. Your 
hypotheses change, and your experience in the world, X, may be different from someone else’s, but 
the data, D, is shared by all. 

Consider the following two formulas. The first is one we’ve used throughout this chapter: 

P{D | H,X) 

which we read as "The probability of the data given my hypotheses and experience in the world,” or 
more plainly, "How well my beliefs explain what I observe.” 

But there is a reversal of this, common in everyday thinking, which is: 

P[H | D,X) 

We read this as "The probability of my beliefs given the data and my experiences in the world,” or 
"How well what I observe supports what I believe.” 

In the first case, we change our beliefs according to data we gather and observations we make 
about the world that describe it better. In the second case, we gather data to support our existing 
beliefs. Bayesian thinking is about changing your mind and updating how you understand the 



world. The data we observe is all that is real, so our beliefs ultimately need to shift until they align 
with the data. 

In life, too, your beliefs should always be mutable. 

As the film crew packs up, you notice that all the vans bear military insignia. The crew takes off 
their coats to reveal army fatigues and you overhear someone say, "Well, that should have fooled 
anyone who saw that... good thinking.” 

With this new evidence, your beliefs may shift again! 


WRAPPING UP 


Let’s recap what you've learned. Your beliefs start with your existing experience of the world, X. 
When you observe data, D, it either aligns with your experience, P[D \ X] - very high, or it surprises 
you, P[D | X) = very low. To understand the world, you rely on beliefs you have about what you 
observe, or hypotheses, H. Oftentimes a new hypothesis can help you explain the data that surprises 
you, P[D | H, X] » P[D \ X], When you gather new data or come up with new ideas, you can create 
more hypotheses, Hi, H 2 , H 3 ,... You update your beliefs when a new hypothesis explains your data 
much better than your old hypothesis: 


p(d\h 2 ,x) 

P{D\H V X) 


= large number 


Finally, you should be far more concerned with data changing your beliefs than with ensuring data 
supports your beliefs, P[H \ D ). 

With these foundations set up, you’re ready to start adding numbers into the mix. In the rest of Part 
L you’ll model your beliefs mathematically to precisely determine how and when you should 
change them. 


EXERCISES 

Try answering the following questions to see how well you understand Bayesian reasoning. The 
solutions can be found at https://nostarch.com/learnbayes/ . 

1. Rewrite the following statements as equations using the mathematical notation you 
learned in this chapter: 

• The probability of rain is low 

• The probability of rain given that it is cloudy is high 

• The probability of you having an umbrella given it is raining is much greater 
than the probability of you having an umbrella in general. 

2. Organize the data you observe in the following scenario into a mathematical notation, 
using the techniques we’ve covered in this chapter. Then come up with a hypothesis to 
explain this data: 

You come home from work and notice that your front door is open and the side window is 
broken. As you walk inside, you immediately notice that your laptop is missing. 

3. The following scenario adds data to the previous one. Demonstrate how this new 
information changes your beliefs and come up with a second hypothesis to explain the data, 
using the notation you’ve learned in this chapter. 



A neighborhood child runs up to you and apologizes profusely for accidentally throwing a 
rock through your window. They claim that they saw the laptop and didn’t want it stolen so 
they opened the front door to grab it, and your laptop is safe at their house. 
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MEASURING UNCERTAINTY 



In Chapter 1 we looked at some basic reasoning tools we use intuitively to understand how data 
informs our beliefs. We left a crucial issue unresolved: how can we quantify these tools? In 
probability theory, rather than describing beliefs with terms like very low and high, we need to 
assign real numbers to these beliefs. This allows us to create quantitative models of our 
understanding of the world. With these models, we can see just how much the evidence changes 
our beliefs, decide when we should change our thinking, and gain a solid understanding of our 
current state of knowledge. In this chapter, we will apply this concept to quantify the probability of 
an event. 

WHAT IS A PROBABILITY? 

The idea of probability is deeply ingrained in our everyday language. Whenever you say something 
such as "That seems unlikely!" or "I would be surprised if that’s not the case" or "I’m not sure about 
that," you're making a claim about probability. Probability is a measurement of how strongly we 
believe things about the world. 

In the previous chapter we used abstract, qualitative terms to describe our beliefs. To really analyze 
how we develop and change beliefs, we need to define exactly what a probability is by more 
formally quantifying P(X )—that is, how strongly we believe in X. 

We can consider probability an extension of logic. In basic logic we have two values, true and false, 
which correspond to absolute beliefs. When we say something is true, it means that we are 
completely certain it is the case. While logic is useful for many problems, very rarely do we believe 
anything to be absolutely true or absolutely false; there is almost always some level of uncertainty 
in every decision we make. Probability allows us to extend logic to work with uncertain values 
between true and false. 

Computers commonly represent true as 1 and false as 0, and we can use this model with probability 
as well. P(X] = 0 is the same as saying that A = false, and P[X] = 1 is the same as X- true. Between 0 
and 1 we have an infinite range of possible values. A value closer to 0 means we are more certain 
that something is false, and a value closer to 1 means we’re more certain something is true. It’s 
worth noting that a value of 0.5 means that we are completely unsure whether something is true or 
false. 

Another important part of logic is negation. When we say "not true" we mean false. Likewise, saying 
"not false" means true. We want probability to work the same way, so we make sure that the 


probability of X and the negation of the probability of X sum to 1 (in other words, values are 
either X, or not X]. We can express this using the following equation: 

P(X) + ->P{X) = 1 

mE 

The -i symbol means “negation”or “not.” 

Using this logic, we can always find the negation of P[X) by subtracting it from 1. So, for example, 
if P[X] - 1, then its negation, 1 - P[X], must equal 0, conforming to our basic logic rules. And if P[X] = 
0, then its negation 1 - P(X) = 1. 

The next question is how to quantify that uncertainty. We could arbitrarily pick values: say 0.95 
means very certain, and 0.05 means very uncertain. However, this doesn’t help us determine 
probability much more than the abstract terms we’ve used before. Instead, we need to use formal 
methods to calculate our probabilities. 

CALCULATING PROBABILITIES BY COUNTING OUTCOMES OF 
EVENTS 

The most common way to calculate probability is to count outcomes of events. We have two sets of 
outcomes that are important. The first is all possible outcomes of an event. For a coin toss, this 
would be "heads” or "tails." The second is the count of the outcomes you’re interested in. If you’ve 
decided that heads means you win, the outcomes you care about are those involving heads (in the 
case of a single coin toss, just one event). The events you’re interested in can be anything: flipping a 
coin and getting heads, catching the flu, or a UFO landing outside your bedroom. Given these two 
sets of outcomes—ones you’re interested in and ones you’re not interested in—all we care about is 
the ratio of outcomes we’re interested in to the total number of possible outcomes. 

We’ll use the simple example of a coin flip, where the only possible outcomes are the coin landing 
on heads or landing on tails. The first step is to make a count of all the possible events, which in this 
case is only two: heads or tails. In probability theory, we use fl (the capital Greek letter omega) to 
indicate the set of all events: 

H = (heads, tails) 

We want to know the probability of getting a heads in a single coin toss, written as P(heads). We 
therefore look at the number of outcomes we care about, 1, and divide that by the total number of 
possible outcomes, 2: 

{heads} 

{heads, tails} 

For a single coin toss, we can see that there is one outcome we care about out of two possible 
outcomes. So the probability of heads is just: 

1 

2 

Now let’s ask a trickier question: what is the probability of getting at least one heads when we toss 
two coins? Our list of possible events is more complicated; it’s not just (heads, tails) but rather all 
possible pairs of heads and tails: 




fi = {(heads, heads),(heads, tails),(tails, tails),(tails, heads)} 


To figure out the probability of getting at least one heads, we look at how many of our pairs match 
our condition, which in this case is: 

{(heads, heads),(heads, tails),(tails, heads)} 

As you can see, the set of events we care about has 3 elements, and there are 4 possible pairs we 
could get. This means that P(at least one heads) = 3/4. 

These are simple examples, but if you can count the events you care about and the total possible 
events, you can come up with a quick and easy probability. As you can imagine, as examples get 
more complicated, manually counting each possible outcome becomes unfeasible. Solving harder 
probability problems of this nature often involves a field of mathematics called combinatorics. 

In Chapter 4 we’ll see how we can use combinatorics to solve a slightly more complex problem. 

CALCULATING PROBABILITIES AS RATIOS OF BELIEFS 

Counting events is useful for physical objects, but it’s not so great for the vast majority of real-life 
probability questions we might have, such as: 

• "What’s the probability it will rain tomorrow?” 

• "Do you think she’s the president of the company?" 

. "Is that a UFO!?" 

Nearly every day you make countless decisions based on probability, but if someone asked you to 
solve "How likely do think you are to make your train on time?" you couldn’t calculate it with the 
method just described. 

This means we need another approach to probability that can be used to reason about these more 
abstract problems. As an example, suppose you’re chatting about random topics with a friend. Your 
friend asks if you’ve heard of the Mandela effect and, since you haven’t, proceeds to tell you: "It’s 
this weird thing where large groups of people misremember events. For example, many people 
recall Nelson Mandela dying in prison in the 80s. But the wild thing is that he was released from 
prison, became president of South Africa, and didn’t die until 2013!" Skeptically, you turn to your 
friend and say, "That sounds like internet pop psychology. I don’t think anyone seriously 
misremembered that; I bet there’s not even a Wikipedia entry on it!” 

From this, you want to measure P(No Wikipedia article on Mandela effect). Let’s assume you are in 
an area with no cell phone reception, so you can’t quickly verify the answer. You have a high 
certainty of your belief that there is no such article, and therefore you want to assign a high 
probability for this belief, but you need to formalize that probability by assigning it a number from 
0 to 1. Where do you start? 

You decide to put your money where your mouth is, telling your friend: "There’s no way that’s real. 
How about this: you give me $5 if there is no article on the Mandela effect, and I’ll give you $100 if 
there is one\" Making bets is a practical way that we can express how strongly we hold our beliefs. 
You believe that the article’s existence is so unlikely that you’ll give your friend $100 if you are 
wrong and only get $5 from them if you are right. Because we’re talking about quantitative values 
regarding our beliefs, we can start to figure out an exact probability for your belief that there is no 
Wikipedia article on the Mandela effect. 



Using Odds to Determine Probability 

Your friend’s hypothesis is that there is an article about the Mandela effect: //article. And you have an 
alternate hypothesis: H no article- 

We don’t have concrete probabilities yet, but your bet expresses how strongly you believe in your 
hypothesis by giving the odds of the bet. Odds are a common way to represent beliefs as a ratio of 
how much you would be willing to pay if you were wrong about the outcome of an event to how 
much you’d want to receive for being correct. For example, say the odds of a horse winning a race 
are 12 to 1. That means if you pay $1 to take the bet, the track will pay you $12 if the horse wins. 
While odds are commonly expressed as "m to n" we can also view them as a simple ratio: m/n. 
There is a direct relationship between odds and probabilities. 

We can express your bet in terms of odds as "100 to 5.” So how can we turn this into probability? 
Your odds represent how many times more strongly you believe there isn't an article than you 
believe there is an article. We can write this as the ratio of your belief in there being no article, P(//„ 0 
article), to your friend’s belief that there is one, Pf/Amde), like so: 

= ioo = 2Q 

p(h^) s 

From the ratio of these two hypotheses, we can see that your belief in the hypothesis that there is 
no article is 20 times greater than your belief in your friend’s hypothesis. We can use this fact to 
work out the exact probability for your hypothesis using some high school algebra. 

Solving for the Probabilities 

We start writing our equation in terms of the probability of your hypothesis, since this is what we 
are interested in knowing: 

P[H no article ) = 20 X P(// ar tide) 

We can read this equation as "The probability that there is no article is 20 times greater than the 
probability there is an article." 

There are only two possibilities: either there is a Wikipedia article on the Mandela effect or there 
isn’t. Because our two hypotheses cover all possibilities, we know that the probability of an article is 
just 1 minus the probability of no article, so we can substitute P(//a rt ide) with its value in terms 
of P(//„„ article ) in our equation like so: 

P{H no article ) = 20 x (1 - P(// arBcle )) 

Next we can expand 20 x (1 - P(// no article)) by multiplying both parts in the parentheses by 20 and we 
get: 

P{H no article ) = 20 - 20 x P[H no article J 

We can remove the P[H„ 0 article) term from the right side of the equation by adding 20 x P(// no article) to 
both sides to isolate P(//„ 0 article) on the left side of the equation: 

2 1 X P[Hno article ) = 20 

And we can divide both sides by 21, finally arriving at: 




no article 



20 

21 


Now you have a nice, clearly defined value between 0 and 1 to assign as a concrete, quantitative 
probability to your belief in the hypothesis that there is no article on the Mandela effect. We can 
generalize this process of converting odds to probability using the following equation: 


1 



0(H) 

+ 0(W) 


Often in practice, when you’re confronted with assigning a probability to an abstract belief, it can be 
very helpful to think of how much you would bet on that belief. You would likely take a billion to 1 
bet that the sun will rise tomorrow, but you might take much lower odds for your favorite baseball 
team winning. In either case, you can calculate an exact number for the probability of that belief 
using the steps we just went through. 


Measuring Beliefs in a Coin Toss 

We now have a method for determining the probability of abstract ideas using odds, but the real 
test of the robustness of this method is whether or not it still works with our coin toss, which we 
calculated by counting outcomes. Rather than thinking about a coin toss as an event, we can 
rephrase the question as "How strongly do I believe the next coin toss will be heads?” Now we’re 
not talking about P(heads) but rather a hypothesis or belief about the coin toss, P(/Peads). 

Just like before, we need an alternate hypothesis to compare our belief with. We could say the 
alternate hypothesis is simply not getting heads /Pheads, but the option of getting tails /Pans is closer to 
our everyday language, so we’ll use that. At the end of the day what we care about most is making 
sense. However, it is important for this discussion to acknowledge that: 

fftails — /P heads, and P(Htails) = 1 - P(Hheads) 

We can look at how to model our beliefs as the ratio between these competing hypotheses: 

p(h^) 

p(h^) 

Remember that we want to read this as "How many times greater do I believe that the outcome will 
be heads than I do that it will be tails?” As far as bets go, since each outcome is equally uncertain, 
the only fair odds are 1 to 1. Of course, we can pick any odds as long as the two values are equal: 2 
to 2, 5 to 5, or 10 to 10. All of these have the same ratio: 

P(H^) 10 _ 5 _ 2 _ 1 _ l 

P(H^) 10 5 2 1 

Given that the ratio of these is always the same, we can simply repeat the process we used to 
calculate the probability of there being no Wikipedia article on the Mandela effect. We know that 
our probability of heads and probability of tails must sum to 1, and we know that the ratio of these 
two probabilities is also 1. So, we have two equations that describe our probabilities: 

P(H^) + P(H^) = 1, and = 1 

tails ) 



If you walk through the process we used when reasoning about the Mandela effect, solving in terms 
of P(//heads) you should find the only possible solution to this problem is 1/2. This is exactly the same 
result we arrived at with our first approach to calculating probabilities of events, and it proves that 
our method for calculating the probability of a belief is robust enough to use for the probability of 
events! 

With these two methods in hand, it’s reasonable to ask which one you should use in which 
situation. The good news is, since we can see they are equivalent, you can use whichever method is 
easiest for a given problem. 

WRAPPING UP 

In this chapter we explored two different types of probabilities: those of events and those of beliefs. 
We define probability as the ratio of the outcome(s) we care about to the number of all possible 
outcomes. 

While this is the most common definition of probability, it is difficult to apply to beliefs because 
most practical, everyday probability problems do not have clear-cut outcomes and so aren’t 
intuitively assigned discrete numbers. 

To calculate the probability of beliefs, then, we need to establish how many times more we believe 
in one hypothesis over another. One good test of this is how much you would be willing to bet on 
your belief—for example, if you made a bet with a friend in which you’d give them $1,000 for proof 
that UFOs exist and would receive only $1 from them for proof that UFOs don’t exist. Here you are 
saying you believe UFOs do not exist 1,000 times more than you believe they do exist. 

With these tools in hand, you can calculate the probability for a wide range of problems. In the next 
chapter you’ll learn how you can apply the basic operators of logic, AND and OR, to our 
probabilities. But before moving on, try using what you’ve learned in this chapter to complete the 
following exercises. 

EXERCISES 

Try answering the following questions to make sure you understand how we can assign real values 
between 0 and 1 to our beliefs. Solutions to the questions can be found 
at h ttps://n ostarch.com/learn bayes/. 

1. What is the probability of rolling two six-sided dice and getting a value greater than 
7? 

2. What is the probability of rolling three six-sided dice and getting a value greater than 
7? 

3. The Yankees are playing the Red Sox. You’re a diehard Sox fan and bet your friend 
they’ll win the game. You’ll pay your friend $30 if the Sox lose and your friend will have to 
pay you only $5 if the Sox win. What is the probability you have intuitively assigned to the 
belief that the Red Sox will win? 
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THE LOGIC OF UNCERTAINTY 



In Chapter 2, we discussed how probabilities are an extension of the true and false values in logic 
and are expressed as values between 1 and 0. The power of probability is in the ability to express 
an infinite range of possible values between these extremes. In this chapter, we’ll discuss how the 
rules of logic, based on these logical operators, also apply to probability. In traditional logic, there 
are three important operators: 

. AND 

. OR 

. NOT 

With these three simple operators we can reason about any argument in traditional logic. For 
example, consider this statement: If it is raining AND I am going outside, I will need an umbrella. This 
statement contains just one logical operator: AND. Because of this operator we know that if it’s true 
that it is raining, AND it is true that I am going outside, I'll need an umbrella. 

We can also phrase this statement in terms of our other operators: If it is NOT raining OR if I am 
NOT going outside, I will NOT need an umbrella. In this case we are using basic logical operators and 
facts to make a decision about when we do and don’t need an umbrella. 

However, this type of logical reasoning works well only when our facts have absolute true or false 
values. This case is about deciding whether I need an umbrella right now, so we can know for 
certain if it’s currently raining and whether I'm going out, and therefore I can easily determine if I 
need an umbrella. Suppose instead we ask, "Will I need an umbrella tomorrow?" In this case our 
facts become uncertain, because the weather forecast gives me only a probability for rain tomorrow 
and I may be uncertain whether or not I need to go out. 

This chapter will explain how we can extend our three logical operators to work with probability, 
allowing us to reason about uncertain information the same way we can with facts in traditional 
logic. We’ve already seen how we can define NOT for probabilistic reasoning: 

->P(X) =1- P(X) 

In the rest of this chapter we’ll see how we can use the two remaining operators, AND and OR, to 
combine probabilities and give us more accurate and useful data. 


COMBINING PROBABILITIES WITH AND 

In statistics we use AND to talk about the probability of combined events. For example, the 
probability of: 

• Rolling a 6 AND flipping a heads 

• It raining AND you forgetting your umbrella 

• Winning the lottery AND getting struck by lightning 

To understand how we can define AND for probability, we’ll start with a simple example involving 
coin and a six-sided die. 


Solving a Combination of Two Probabilities 

Suppose we want to know the probability of getting a heads in a coin flip AND rolling a 6 on a die. 
We know that the probability of each of these events individually is: 

P(heads) = i, P(six) = i 

Now we want to know the probability of both of these things occurring, written as: 

P(heads, six) = ? 


We can calculate this the same way we did in Chapter 2: we count the outcomes we care about and 
divide that by the total outcomes. 

For this example, let’s imagine these events happening in sequence. When we flip the coin we have 
two possible outcomes, heads and tails, as depicted in Figure 3-1. 



Now, for each possible coin flip there are six possible results for the roll of our die, as depicted 
in Figure 3-2. 




Figure 3-2: Visualizing the possible outcomes from a coin toss and the roll of a die 

Using this visualization, we can just count our possible solutions. There are 12 possible outcomes of 
flipping a coin and rolling a die, and we care about only one of these outcomes, so: 

P( heads, six) = — 
v 7 12 



Now we have a solution for this particular problem. However, what we really want is a general rule 
that will help us calculate this for any number of probability combinations. Let’s see how to expand 
our solution. 


Applying the Product Rule of Probability 

We'll use the same problem for this example: what is the probability of flipping a heads and rolling 
a 6? First we need to figure out the probability of flipping a heads. Looking at our branching paths, 
we can figure out how many paths split off given the probabilities. We care only about the paths 
that include heads. Because the probability of heads is 1/2, we eliminate half of our possibilities. 
Then, if we look only at the remaining branch of possibilities for the heads, we can see that there is 
only a 1/6 chance of getting the result we want: rolling a 6 on a six-sided die. In Figure 3-3 we can 
visualize this reasoning and see that there is only one outcome we care about. 




Figure 3-3: Visualizing the probability of both getting a heads and rolling a 6 
If we multiply these two probabilities, we can see that: 

1 I- J_ 

2*6 ~ 12 


This is exactly the answer we had before, but rather than counting all possible events, we counted 
only the probabilities of the events we care about by following along the branches. This is easy 
enough to do visually for such a simple problem, but the real value of showing you this is that it 
illustrates a general rule for combining probabilities with AND: 

P[A,B) = P(A) x P[B ) 

Because we are multiplying our results, also called taking the product of these results, we refer to 
this as the product rule of probability. 

This rule can then be expanded to include more probabilities. If we think of P[A,B ) as a single 
probability, we can combine it with a third probability, P[C], by repeating this process: 

P(P(A,B),C) = P{A,B ) x P(C] = P[A ) x P(fi) x P[C] 

So we can use our product rule to combine an unlimited number of events to get our final 
probability. 

Example: Calculating the Probability of Being Late 

Let’s look at an example of using the product rule for a slightly more complex problem than rolling 
dice or flipping coins. Suppose you promised to meet a friend for coffee at 4:30 on the other side of 
town, and you plan to take public transportation. It’s currently 3:30. Thankfully the station you’re at 
has both a train and bus that can take you where you need to go: 

• The next bus comes at 3:45 and takes 45 minutes to get you to the coffee shop. 

• The next train comes at 3:50, and will get you within a 10-minute walk in 30 minutes. 
Both the train and the bus will get you there at 4:30 exactly. Because you’re cutting it so close, any 
delay will make you late. The good news is that, since the bus arrives before the train, if the bus is 
late and the train is not you’ll still be on time. If the bus is on time and the train is late, you’ll also be 
fine. The only situation that will make you late is if both the bus and the train are late to arrive. How 
can you figure out the probability of being late? 

First, you need to establish the probability of both the train being late and the bus being late. Let’s 
assume the local transit authority publishes these numbers (later in the book, you’ll learn how to 
estimate this from data). 


P(Late train ) = 0.15 
P(Late bus ) = 0.2 


The published data tells us that 15 percent of the time the train is late, and 20 percent of the time 
the bus is late. Since you’ll be late only if both the bus and the train are late, we can use the product 
rule to solve this problem: 

P(Late) = P(Latetram) x P(Late bus ) = 0.15 x 0.2 = 0.03 

Even though there’s a pretty reasonable chance that either the bus or the train will be late, the 
probability that they will both be late is significantly less, at only 0.03. We can also say there is a 3 
percent chance that both will be late. With this calculation done, you can be a little less stressed 
about being late. 



COMBINING PROBABILITIES WITH OR 

The other essential rule of logic is combining probabilities with OR, some examples of which 
include: 

• Catching the flu OR getting a cold 

• Flipping a heads on a coin OR rolling a 6 on a die 

• Getting a flat tire OR running out of gas 

The probability of one event OR another event occurring is slightly more complicated because the 
events can either be mutually exclusive or not mutually exclusive. Events are mutually exclusive if 
one event happening implies the other possible events cannot happen. For example, the possible 
outcomes of rolling a die are mutually exclusive because a single roll cannot yield both a 1 and a 6. 
However, say a baseball game will be cancelled if it is either raining or the coach is sick; these 
events are not mutually exclusive because it is perfectly possible that the coach is sick and it rains. 


Calculating OR for Mutually Exclusive Events 


The process of combining two events with OR feels logically intuitive. If you’re asked, "What is the 
probability of getting heads or tails on a coin toss?" you would say, "1." We know that: 

P(heads) = —, P(tails) = — 


Intuitively, we might just add the probability of these events together. We know this works because 
heads and tails are the only possible outcomes, and the probability of all possible outcomes must 
equal 1. If the probabilities of all possible events did not equal 1, then we would have some 
outcome that was missing. So how do we know that there would need to be a missing outcome if 
the sum was less than 1? 

Suppose we know that the probability of heads is P(heads) = 1/2, but someone claimed that the 
probability of tails was P(tails) = 1/3. We also know from before that the probability of not getting 
heads must be: 

NOTP(heads) = l-i = i 


Since the probability of not getting heads is 1/2 and the claimed probability for tails is only 1/3, 
either there is a missing event or our probability for tails is incorrect. 

From this we can see that, as long as events are mutually exclusive, we can simply add up all of the 
probabilities of each possible event to get the probability of either event happening to calculate the 
probability of one event OR the other. Another example of this is rolling a die. We know that the 
probability of rolling a 1 is 1/6, and the same is true for rolling a 2: 

P(one) = ~, P(two) = i 
0 6 


So we can perform the same operation, adding the two probabilities, and see that the combined 
probability of rolling either a 1 OR a 2 is 2/6, or 1/3: 

1 = 1 
6 3 



Again, this makes intuitive sense. 



This addition rule applies only to combinations of mutually exclusive outcomes. In probabilistic 
terms, mutually exclusive means that: 

P{A) AND P[B ) = 0 

That is, the probability of getting both A AND B together is 0. We see that this holds for our 
examples: 

• It is impossible to flip one coin and get both heads and tails. 

• It is impossible to roll both a 1 and a 2 on a single roll of a die. 

To really understand combining probabilities with OR, we need to look at the case where events 
are not mutually exclusive. 

Using the Sum Rule for Non-Mutually Exclusive Events 

Again using the example of rolling a die and flipping a coin, let’s look at the probability of either 
flipping heads OR rolling a 6. Many newcomers to probability may naively assume that adding 
probabilities will work in this case as well. Given that we know that P(heads) = 1/2 and P(six) = 
1/6, it might initially seem plausible that the probability of either of these events is simply 4/6. It 
becomes obvious that this doesn’t work, however, when we consider the possibility of either 
flipping a heads or rolling a number less than 6. Because P(less than six) = 5/6, adding these 
probabilities together gives us 8/6, which is greater than 1! Since this violates the rule that 
probabilities must be between 0 and 1, we must have made a mistake. 

The trouble is that flipping a heads and rolling a 6 are not mutually exclusive. As we know from 
earlier in the chapter, P(heads, six) = 1/12. Because the probability of both events happening at the 
same time is not 0, we know they are, by definition, not mutually exclusive. 

The reason that adding our probabilities doesn’t work for non-mutually exclusive events is that 
doing so doubles the counting of events where both things happen. As an example of overcounting, 
let’s look at all of the outcomes of our combined coin toss and die roll that contain heads: 

Heads — 1 
Heads — 2 
Heads — 3 
Heads — 4 
Heads — 5 
Heads — 6 

These outcomes represent 6 out of the 12 possible outcomes, which we expect since P(heads) = 
1/2. Now let’s look at all outcomes that include rolling a 6: 

Heads — 6 
Tails — 6 

These outcomes represent the 2 out of 12 possible outcomes that will result in us rolling a 6, which 
again we expect because P(six) = 1/6. Since there are six outcomes that satisfy the condition of 
flipping a heads and two that satisfy the condition of rolling a 6, we might be tempted to say that 
there are eight outcomes that represent getting either heads or rolling a 6. However, we would be 
double-counting because Heads — 6 appears in both lists. There are, in fact, only 7 out of 12 unique 
outcomes. If we naively add P(heads) and P(six), we end up overcounting. 



To correct our probabilities, we must add up all of our probabilities and then subtract the 
probability of both events occurring. This leads us to the rule for combining non-mutually exclusive 
probabilities with OR, known as the sum rule of probability: 


P[A) OR P(B) = P(A) + P(P) - P{A,B ) 


We add the probability of each event happening and then subtract the probability of both events 
happening, to ensure we are not counting these probabilities twice since they are a part of 
both P(A) and P(fi). So, using our die roll and coin toss example, the probability of rolling a number 
less than 6 or flipping a heads is: 


P(heads) OR P(six) = P(heads) + P(six) - P (heads, six) = 



6 


12 


7 _ 

12 


Let’s take a look at a final OR example to really cement this idea. 


Example: Calculating the Probability of Getting a Hefty Fine 

Imagine a new scenario. You were just pulled over for speeding while on a road trip. You realize 
you haven’t been pulled over in a while and may have forgotten to put either your new registration 
or your new insurance card in the glove box. If either one of these is missing, you’ll get a more 
expensive ticket. Before you open the glove box, how can you assign a probability that you’ll have 
forgotten one or the other of your cards and you’ll get the higher ticket? 

You’re pretty confident that you put your registration in the car, so you assign a 0.7 probability to 
your registration being in the car. However, you’re also pretty sure that you left your insurance 
card on the counter at home, so you assign only a 0.2 chance that your new insurance card is in the 
car. So we know that: 


P(registration) = 0.7 
P(insurance) = 0.2 

However, these values are the probabilities that you do have these things in the glove box. You’re 
worried about whether either one is missing. To get the probabilities of missing items, we simply 
use negation: 

P(Missingreg) = 1 - P(registration) = 0.3 
P(Missingms) = 1 - P(insurance) = 0.8 

If we try using our addition method, instead of the complete sum rule, to get the combined 
probability, we see that we have a probability greater than 1: 

P(Missingreg) + P(Missingms) = 1.1 

This is because these events are non-mutually exclusive: it’s entirely possible that you have 
forgotten both cards. Therefore, using this method we’re double-counting. That means we need to 
figure out the probability that you’re missing both cards so we can subtract it. We can do this with 
the product rule: 

P(Missingre g , Missings) = 0.24 

Now we can use the sum rule to determine the probability that either one of these cards is missing, 
just as we worked out the probability of a flipping a heads or rolling a 6: 



P(Missing) = P(Missing reg ) + P(Missingms) - P(Missingr eg , Missings) = 0.86 


With an 0.86 probability that one of these important pieces of paper is missing from your glove box, 
you should make sure to be extra nice when you greet the officer! 

WRAPPING UP 

In this chapter you developed a complete logic of uncertainty by adding rules for combining 
probabilities with AND and OR. Let’s review the logical rules we have covered so far. 

In Chapter 2, you learned that probabilities are measured on a scale of 0 to 1, 0 
being/a/se(definitely not going to happen), and 1 being true (definitely going to happen). The next 
important logical rule involves combining two probabilities with AND. We do this using the product 
rule, which simply states that to get the probability of two events occurring together, P[A) and P(fi), 
we just multiply them together: 

P[A,B) = P(A) x P(fi) 

The final rule involves combining probabilities with OR using the sum rule. The tricky part of the 
sum rule is that if we add non-mutually exclusive probabilities, we’ll end up overcounting for the 
case where they both occur, so we have to subtract the probability of both events occurring 
together. The sum rule uses the product rule to solve this (remember, for mutually exclusive 
events, P[A, E) = 0): 

P[A OR 5) = P(A) + P(fi) - P(A,P) 

These rules, along with those covered in Chapter 2, allow us to express a very large range of 
problems. We’ll be using these as the foundation for our probabilistic reasoning throughout the rest 
of the book. 

EXERCISES 

Try answering the following questions to make sure you understand the rules of logic as they apply 
to probability. The solutions can be found at https://nostarch.com/learnbayes/. 

1. What is the probability of rolling a 20 three times in a row on a 20-sided die? 

2. The weather report says there’s a 10 percent chance of rain tomorrow, and you forget 
your umbrella half the time you go out. What is the probability that you’ll be caught in the 
rain without an umbrella tomorrow? 

3. Raw eggs have a 1/20,000 probability of having salmonella. If you eat two raw eggs, 
what is the probability you ate a raw egg with salmonella? 

4. What is the probability of either flipping two heads in two coin tosses or rolling three 
6s in three six-sided dice rolls? 
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CREATING A BINOMIAL PROBABILITY DISTRIBUTION 



In Chapter 3, you learned some basic rules of probability corresponding to the common logical 
operators: AND, OR, and NOT. In this chapter we’re going to use these rules to build our 
first probability distribution, a way of describing all possible events and the probability of each one 
happening. Probability distributions are often visualized to make statistics more palatable to a 
wider audience. We’ll arrive at our probability distribution by defining a function that generalizes a 
particular group of probability problems, meaning we'll create a distribution to calculate the 
probabilities for a whole range of situations, not just one particular case. 

We generalize in this way by looking at the common elements of each problem and abstracting 
them out. Statisticians use this approach to make solving a wide range of problems much easier. 
This can be especially useful when problems are very complex, or some of the necessary details 
may be unknown. In these cases, we can use well-understood probability distributions as estimates 
for real-world behavior that we don’t fully understand. 

Probability distributions are also very useful for asking questions about ranges of possible values. 
For example, we might use a probability distribution to determine the probability that a customer 
makes between $30,000 and $45,000 a year, the probability of an adult being taller than 6’ 10”, or 
the probability that between 25 percent and 35 percent of people who visit a web page will sign up 
for an account there. Many probability distributions involve very complex equations and can take 
some time to get used to. However, all the equations for probability distributions are derived from 
the basic rules of probability covered in the previous chapters. 

STRUCTURE OF A BINOMIAL DISTRIBUTION 

The distribution you’ll learn about here is the binomial distribution, used to calculate the probability 
of a certain number of successful outcomes, given a number of trials and the probability of the 
successful outcome. The "bi” in the term binomial refers to the two possible outcomes that we’re 
concerned with: an event happening and an event not happening. If there are more than two 
outcomes, the distribution is called multinomial. Example problems that follow a binomial 
distribution include the probability of: 

• Flipping two heads in three coin tosses 

• Buying 1 million lottery tickets and winning at least once 

• Rolling fewer than three 20s in 10 rolls of a 20-sided die 

Each of these problems shares a similar structure. Indeed, all binomial distributions involve 
three parameters : 


k The number of outcomes we care about 
n The total number of trials 
p The probability of the event happening 

These parameters are the inputs to our distribution. So, for example, when we’re calculating the 
probability of flipping two heads in three coin tosses: 

• k = 2, the number of events we care about, in this case flipping a heads 

• n = 3, the number times the coin is flipped 

• p = 1/2, the probability of flipping a heads in a coin toss 

We can build out a binomial distribution to generalize this kind of problem, so we can easily solve 

any problem involving these three parameters. The shorthand notation to express this distribution 
looks like this: 

B[k',n, p) 

For the example of three coin tosses, we would write f?(2; 3,1/2). The B is short 
for b/nom/a/distribution. Notice that the k is separated from the other parameters by a semicolon. 
This is because when we are talking about a distribution of values, we usually care about all values 
of kfor a fixed n and p. So B[k; n, p) denotes each value in our distribution, but the entire 
distribution is usually referred to by simply B(n, p). 

Let’s take a look at this more closely and see how we can build a function that allows us to 
generalize all of these problems into the binomial distribution. 

UNDERSTANDING AND ABSTRACTING OUT THE DETAILS OF 
OUR PROBLEM 

One of the best ways to see how creating distributions can simplify your probabilities is to start 
with a concrete example and try to solve that, and then abstract out as many of the variables as you 
can. We’ll continue with the example of calculating the probability of flipping two heads in three 
coin tosses. 

Since the number of possible outcomes is small, we can quickly figure out the results we care about 
with just pencil and paper. There are three possible outcomes with two heads in three tosses: 

HHT, HTH, THH 

Now it may be tempting to just solve this problem by enumerating all the other possible outcomes 
and dividing the number we care about by the total number of possible outcomes (in this case, 8). 
That would work fine for solving Just this problem, but our aim here is to solve any problem that 
involves desiring a set of outcomes, from a number of trials, with a given probability that the event 
occurs. If we did not generalize and solved only this one instance of the problem, changing these 
parameters would mean we have to solve the new problem again. For example, just saying, "What is 
the probability of getting two heads in four coin tosses?" means we need to come up with yet 
another unique solution. Instead, we’ll use the rules of probability to reason about this problem. 

To start generalizing, we’ll break this problem down into smaller pieces we can solve right now, 
and reduce those pieces into manageable equations. As we build up the equations, we'll put them 
together to create a generalized function for the binomial distribution. 

The first thing to note is that each outcome we care about will have the same probability. Each 
outcome is just a permutation, or reordering, of the others: 



P((heads, heads, tails}) = P((heads, tails, heads}) = P({tails, heads, heads}) 

Since this is true, we’ll simply call it: 

P(Desired Outcome) 

There are three outcomes, but only one of them can possibly happen and we don’t care which. And 
because it’s only possible for one outcome to occur, we know that these are mutually exclusive, 
denoted as: 

P({heads, heads, tails},(heads, tails, heads),(tails, heads, heads}) = 0 

This makes using the sum rule of probability easy. Now we can summarize this nicely as: 

PQheads, heads, tails) or (heads, tails, heads) or (tails, heads, heads}) = P(Desired Outcome) 

+ P(Desired Outcome) + P(Desired Outcome) 

Of course adding these three is just the same as: 

3 x P(Desired Outcome) 

We’ve got a condensed way of referencing the outcomes we care about, but the trouble as far as 
generalizing goes is that the value 3 is specific to this problem. We can fix this by simply replacing 3 
with a variable called N ou tomes. This leaves us with a pretty nice generalization: 

B{k\n, p) = iVoutcomes x P(Desired Outcome) 

Now we have to figure out two subproblems: how to count the number of outcomes we care about, 
and how to determine the probability for a single outcome. Once we have these fleshed out, we’ll be 
all set! 

COUNTING OUR OUTCOMES WITH THE BINOMIAL 
COEFFICIENT 

First we need to figure out how many outcomes there are for a given k (the outcomes we care 
about) and n (the number of trials). For small numbers we can simply count. If we were looking at 
four heads in five coin tosses, we know there are five outcomes we care about: 

HHHHT, HTHHH, HHTHH, HHHTH, HHHHT 

But it doesn’t take much for this to become too difficult to do by hand—for example, "What is the 
probability of rolling two 6s in three rolls of a six-sided die?” 

This is still a binomial problem, because the only two possible outcomes are getting a 6 or not 
getting a 6, but there are far more events that count as "not getting a 6.” If we start enumerating we 
quickly see this gets tedious, even for a small problem involving just three rolls of a die: 

6 - 6-1 

6 - 6-2 

6-6-3 


4-6-6 



5-6-6 


Clearly, enumerating all of the possible solutions will not scale to even reasonably trivial problems. 
The solution is combinatorics. 


Combinatorics: Advanced Counting with the Binomial Coefficient 

We can gain some insight into this problem if we take a look at a field of mathematics 
called combinatorics. This is simply the name for a kind of advanced counting. 

There is a special operation in combinatorics, called the binomial coefficient, that represents 
counting the number of ways we can select k from n —that is, selecting the outcomes we care about 
from the total number of trials. The notation for the binomial coefficient looks like this: 



We read this expression as “n choose k." So, for our example, we would represent "in three tosses 
choose two heads" as: 


The definition of this operation is: 


\ k y 


n\ 


ftlx(n-ft)! 


The ! means factorial, which is the product of all the numbers up to and including the number 
before the ! symbol, so 5! = (5 x 4 x 3 x 2 x 1). 

Most mathematical programming languages indicate the binomial coefficient using 

the choose()function. For example, with the mathematical language R, we would compute the 

binomial coefficient for the case of flipping two heads in three tosses with the following call: 


choose(3,2) 

»3 


With this general operation for calculating the number of outcomes we care about, we can update 
our generalized formula like so: 




B(k;n,p ) = x P(Desired Outcome) 




Recall that P(Desired Outcome) is the probability of any one of the combinations of getting two 
heads in three coin tosses. In the preceding equation, we use this value as a placeholder, but we 
don’t actually know how to calculate what this value is. The only missing piece of our puzzle is 
solving P(Single Outcome). After that, we’ll be able to easily generalize an entire class of problems! 





Calculating the Probability of the Desired Outcome 

All we have left to figure out is the P(Desired Outcome), which is the probability of any of the 
possible events we care about. So far we've been using P(Desired Outcome) as a variable to help 
organize our solution to this problem, but now we need to figure out exactly how to calculate this 
value. Let’s look at the probability of getting two heads in five tosses. We’ll focus on a single case of 
an outcome that meets this condition: HHTTT. 

We know the probability of flipping a heads in a single toss is 1/2, but to generalize the problem 
we’ll work with it as P(heads) so we won’t be stuck with a fixed value for our probability. Using the 
product rule and negation from the previous chapter, we can describe this problem as: 

P(heads, heads, not heads, not heads, not heads) 

Or, more verbosely, as: "The probability of flipping heads, heads, not heads, not heads, and not 
heads.” 

Negation tells us that we can represent "not heads” as 1 - P(heads). Then we can use the product 
rule to solve the rest: 

P(heads, heads, not heads, not heads, not heads) = P(heads) x P(heads) x (1 - P(heads)) x (1 
- P(heads)) x (1 - P(heads)) 

Let’s simplify the multiplication by using exponents: 

P(heads) 3 x (1 - P(heads)) 3 

If we put this all together, we see that: 


(two heads in five tosses) = P(heads) 2 x (1 - P(heads)) 3 


You can see that the exponents for P(heads) 3 and 1 - P(heads) 3 are just the number of heads and the 
number of not heads in that scenario. These equate to k, the number of outcomes we care about, 
and n-k, the number of trials minus the outcomes we care about. We can put all of this together to 
create this much more general formula, which eliminates numbers specific to this case: 




P (heads)* x (l — P (heads)) 


Now let’s generalize it for any probability, not just heads, by replacing P(heads) with just p. This 
gives us a general solution for k, the number of outcomes we care about; n, the number of trials; 
and p, the probability of the individual outcome: 




B(k;n,p) = \^ k jx p k x(\- p) 


n-k 


Now that we have this equation, we can solve any problem related to outcomes of a coin toss. For 
example, we could calculate the probability of flipping exactly 12 heads in 24 coin tosses like so: 
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= 0.1612 



Before you learned about the binomial distribution, solving this problem would have been much 
trickier! 

This formula, which is the basis of the binomial distribution, is called a Probability Mass Function 
(PMF). The mass part of the name comes from the fact that we can use it to calculate the amount of 
probability for any given k using a fixed n and p, so this is the mass of our probability. 

For example, we can plug in all the possible values for k in 10 coin tosses into our PMF and visualize 
what the binomial distribution looks like for all possible values, as shown in Figure 4-1. 

Binomial Distribution for 10 Coin Flips 
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Figure 4-1: Bar graph showing the probability of getting k in 10 coin flips 

We can also look at the same distribution for the probability of getting a 6 when rolling a six-sided 
die 10 times, shown in Figure 4-2. 






Binomial Distribution for 10 Rolls of a Six-Sided Die 



k 

Figure 4-2: The probability of getting a 6 when rolling a six-sided die 10 times 

As you can see, a probability distribution is a way of generalizing an entire class of problems. Now 
that we have our distribution, we have a powerful method to solve a wide range of problems. But 
always remember that we derived this distribution from our simple rules of probability. Let’s put it 
to the test. 

EXAMPLE: GACHA GAMES 

Gacha games are a genre of mobile games, particularly popular in Japan, in which players are able 
to purchase virtual cards with in-game currency. The catch is that all cards are given at random, so 
when players purchase cards they can’t choose which ones they receive. Since not all cards are 
equally desirable, players are encouraged to keep pulling cards from the stack until they hit the one 
they want, in a fashion similar to a slot machine. We’ll see how the binomial distribution can help us 
to decide to take a particular risk in an imaginary Gacha game. 

Here’s the scenario. You have a new mobile game, Bayesian Battlers. The current set of cards you 
can pull from is called a banner. The banner contains some average cards and some featured cards 
that are more valuable. As you may suspect, all of the cards in Bayesian Battlers are famous 
probabilists and statisticians. The top cards in this banner are as follows, each with its respective 
probability of being pulled: 





• Thomas Bayes: 0.721% 

• E. T. Jaynes: 0.720% 

• Harold Jeffreys: 0.718% 

• Andrew Gelman: 0.718% 

• John Kruschke: 0.714% 

These featured cards account for only 0.03591 of the total probability. Since probability must sum 
to 1, the chance of pulling the less desirable cards is the other 0.96409. Additionally, we treat the 
pile of cards that we pull from as effectively infinite, meaning that pulling a specific card does not 
change the probability of getting any other card—the card you pull here does not then disappear 
from the pile. This is different than if you were to pull a physical card from a single deck of cards 
without shuffling the card back in. 

You really want the E. T. Jaynes card to complete your elite Bayesian team. Unfortunately, you have 
to purchase the in-game currency, Bayes Bucks, in order to pull cards. It costs one Bayes Buck to 
pull one card, but there’s a special on right now allowing you to purchase 100 Bayes Bucks for only 
$10. That’s the maximum you are willing to spend on this game, and only if you have at least an 
even chance of pulling the card you want. This means you’ll buy the Bayes Bucks only if the 
probability of getting that awesome E. T. Jaynes card is greater than or equal to 0.5. 

Of course we can plug our probability of getting the E. T. Jaynes card into our formula for the 
binomial distribution to see what we get: 

x 0.00720 1 x (1 - 0.00720)" = 0.352 



Our result is less than 0.5, so we should give up. But wait—we forgot something very important! In 
the preceding formula we calculated only the probability of getting exactly one E. T. Jaynes card. But 
we might pull two E. T. Jaynes cards, or even three! So what we really want to know is the 
probability of getting one or more. We could write this out as: 



x 0.00720 1 x (1 - 0.00720)" + 



x 0.00720 2 x (1 - 0.00720 ) 98 + 



x 0.00720 3 x (1 - 0.00720) 97 . .. 


And so on, for the 100 cards you can pull with your Bayes Bucks, but this gets really tedious, so 
instead we use the special mathematical notation £ (the capital Greek letter sigma): 

X 0 . 00720 * X (1 - 0 . 00720 )""* 

The E is the summation symbol; the number at the bottom represents the value we start with and 
the number at the top represents the value we end with. So the preceding equation is simply adding 
up the values for the binomial distribution for every value of k from 1 to n, for a p of 0.00720. 

We’ve made writing this problem down much easier, but now we actually need to compute this 
value. Rather than pulling out your calculator to solve this problem, now is a great time to start 
using R. In R, we can use the pbinom() function to automatically sum up all these values for k in our 
PMF. Figure 4-3 shows how we use pbinomQ to solve our specific problem. 


2fT 




When lower.tail is FALSE, we are looking at the sum of values 
greater than our k argument. When it is TRUE (or left out), we 
are looking at values less than or equal to k. 


pbinom(0,100,0.00720,lower.tail=FALSE) 


The second argument is 
the number of trials, the n 
parameter in our Binomial 
distribution. 


The third argument is the 
probability of our observation, 
the p parameter in our Binomial 
distribution. 


Figure 4-3: Using the pbinomQ function to solve our Bayesian Battlers problem 


The pbinomO function takes three required arguments and an optional fourth called lower.tail(which 
defaults to true). When the fourth argument is true, the first argument sums up all of the 
probabilities less than or equal to our argument. When lower.tail is set to false, it sums up the 
probabilities strictly greater than the first argument. By setting the first argument to o, we are 
looking at the probability of getting one or more E. T. Jaynes cards. We set lower.tail to FALSEbecause 
that means we want values greater than the first argument (by default, we get values less than the 
first argument). The next value represents n, the number of trials, and the third argument 
represents p, the probability of success. 


If we plug in our numbers here and set lower.tail to false as shown in Figure 4-3, R will calculate your 
probability of getting at least one E. T. Jaynes card for your 100 Bayes Bucks: 

mfi oot 


A»1 


v. 


x 0.00720* x (1 - p) a ~ k = 0.515 


J 


Even though the probability of getting exactly one E. T. Jaynes card is only 0.352, the probability of 
getting at least one E. T. Jaynes card is high enough for you to risk it. So shell out that $10 and 
complete your set of elite Bayesians! 


WRAPPING UP 

In this chapter we saw that we can use our rules of probability (combined with a trick from 
combinatorics) to create a general rule that solves an entire class of problems. Any problem that 
involves wanting to determine the probability of k outcomes in n trials, where the probability of the 
outcomes is p, we can solve easily using the binomial distribution: 

* f x ( i ~ py k 


B(k;n,p) = 



Perhaps surprisingly, there is nothing more to this rule than counting and applying our rules of 
probability. 


EXERCISES 

Try answering the following questions to make sure you’ve grasped binomial distributions fully. 
The solutions can be found at https://nostarch.com/learnbayes/. 

1. What are the parameters of the binomial distribution for the probability of rolling 
either a 1 or a 20 on a 20-sided die, if we roll the die 12 times? 

2. There are four aces in a deck of 52 cards. If you pull a card, return the card, then 
reshuffle and pull a card again, how many ways can you pull just one ace in five pulls? 

3. For the example in question 2, what is the probability of pulling five aces in 10 pulls 
(remember the card is shuffled back in the deck when it is pulled)? 

4. When you’re searching for a new job, it’s always helpful to have more than one offer 
on the table so you can use it in negotiations. If you have a 1/5 probability of receiving a job 
offer when you interview, and you interview with seven companies in a month, what is the 
probability you’ll have at least two competing offers by the end of that month? 

5. You get a bunch of recruiter emails and find out you have 25 interviews lined up in 
the next month. Unfortunately, you know this will leave you exhausted, and the probability 
of getting an offer will drop to 1/10 if you’re tired. You really don’t want to go on this many 
interviews unless you are at least twice as likely to get at least two competing offers. Are you 
more likely to get at least two offers if you go for 25 interviews, or stick to just 7? 
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THE BETA DISTRIBUTION 



This chapter builds on the ideas behind the binomial distribution from the previous chapter to 
introduce another probability distribution, the beta distribution. You use the beta distribution to 
estimate the probability of an event for which you've already observed a number of trials and the 
number of successful outcomes. For example, you would use it to estimate the probability of 
flipping a heads when so far you have observed 100 tosses of a coin and 40 of those were heads. 
While exploring the beta distribution, we'll also look at the differences between probability and 
statistics. Often in probability texts, we are given the probabilities for events explicitly. However, in 
real life, this is rarely the case. Instead, we are given data, which we use to come up with estimates 
for probabilities. This is where statistics comes in: it allows us to take data and make estimates 
about what probabilities we’re dealing with. 

A STRANGE SCENARIO: GETTING THE DATA 

Here’s the scenario for this chapter. One day you walk into a curiosity shop. The owner greets you 
and, after you browse for a bit, asks if there is anything in particular you’re looking for. You respond 
that you'd love to see the strangest thing he has to show you. He smiles and pulls something out 
from behind the counter. You're handed a black box, about the size of a Rubik’s Cube, that seems 
impossibly heavy. Intrigued, you ask, "What does it do?" 

The owner points out a small slit on the top of the box and another on the bottom. "If you put a 
quarter in the top," he tells you, "sometimes two come out the bottom!" Excited to try this out, you 
grab a quarter from your pocket and put it in. You wait and nothing happens. Then the shop owner 
says, "And sometimes it just eats your quarter. I've had this thing a while, and I've never seen it run 
out of quarters or get too full to take more!" 

Perplexed by this but eager to make use of your newfound probability skills, you ask, "What’s the 
probability of getting two quarters?" The owner replies quizzically, "I have no idea. As you can see, 
it’s just a black box, and there are no instructions. All I know is how it behaves. Sometimes you get 
two quarters back, and sometimes it eats your quarter." 

Distinguishing Probability, Statistics, and Inference 

While this is a somewhat unusual everyday problem, it’s actually an extremely common type of 
probability problem. In all of the examples so far, outside of the first chapter, we’ve known the 
probability of all the possible events, or at least how much we’d be willing to bet on them. In real 


life we are almost never sure what the exact probability of any event is; instead, we just have 
observations and data. 

This is commonly considered the division between probability and statistics. In probability, we 
know exactly how probable all of our events are, and what we are concerned with is how likely 
certain observations are. For example, we might be told that there is 1/2 probability of getting 
heads in a fair coin toss and want to know the probability of getting exactly 7 heads in 20 coin 
tosses. 

In statistics, we would look at this problem backward: assuming you observe 7 heads in 20 coin 
tosses, what is the probability of getting heads in a single coin toss? As you can see, in this example 
we don’t know what the probability is. In a sense, statistics is probability in reverse. The task of 
figuring out probabilities given data is called inference, and it is the foundation of statistics. 

Collecting Data 

The heart of statistical inference is data! So far we have only a single sample from the strange box: 
you put in a quarter and got nothing back. All we know at this point is that it’s possible to lose your 
money. The shopkeeper said you can win, but we don’t know that for sure yet. 

We want to estimate the probability that the mysterious box will deliver two quarters, and to do 
that, we first need to see how frequently you win after a few more tries. 

The shopkeeper informs you that he’s just as curious as you are and will gladly donate a roll of 
quarters—containing $10 worth of quarters, or 40 quarters—provided you return any winnings to 
him. You put a quarter in, and happily, two more quarters pop out! Now we have two pieces of data: 
the mystical box does in fact pay out sometimes, and sometimes it eats the coin. 

Given our two observations, one where you lose the quarter and another where you win, you might 
guess naively that P(two quarters) = 1/2. Since our data is so limited, however, there is still a range 
of probabilities we might consider for the true rate at which this mysterious box returns two coins. 
To gather more data, you’ll use the rest of the quarters in the roll. In the end, including your first 
quarter, you get: 

14 wins 
27 losses 

Without doing any further analysis, you might intuitively want to update your guess that P(two 
quarters) = 1/2 to P(two quarters) = 14/41. But what about your original guess—does your new 
data mean it’s impossible that 1/2 is the real probability? 


Calculating the Probability of Probabilities 


To help solve this problem, let’s look at our two possible probabilities. These are just our 
hypotheses about the rate at which the magic box returns two quarters: 


P(two coins) = 



. P(two coins) = 


14 

41 


To simplify, we’ll assign each hypothesis a variable: 

H x is P(two coins) = ^ 


H 2 is P(two coins) = 


H 

41 



Intuitively, most people would say that H 2 is more likely because this is exactly what we observed, 
but we need to demonstrate this mathematically to be sure. 

We can think of this problem in terms of how well each hypothesis explains what we saw, so in 
plain English: "How probable is what we observed if Hi were true versus if H 2 were true?” As it 
turns out, we can easily calculate this using the binomial distribution from Chapter 4 . In this case, 
we know that n - 41 and k = 14, and for now, we’ll assume that p = Hi or H 2 . We’ll use D as a variable 
for our data. When we plug these numbers into the binomial distribution, we get the following 
results (recall that you can do this with the formula for the binomial distribution in Chapter 4 ): 

f n 

P(D | H l ) = B 14;41,— *0.016 

V 2y 




0.130 


In other words, if Hi were true and the probability of getting two coins was 1/2, then the 
probability of observing 14 occasions where we get two coins out of 41 trials would be about 0.016. 
However, if H 2 were true and the real probability of getting two coins out of the box was 14/41, 
then the probability of observing the same outcomes would be about 0.130. 

This shows us that, given the data (observing 14 cases of getting two coins out of 41 trials), H 2 is 
almost 10 times more probable than Hi\ However, it also shows that neither hypothesis 
is impossible and that there are, of course, many other hypotheses we could make based on our 
data. For example, we might read our data as H 3 P(two coins) = 15/42. If we wanted to look for a 
pattern, we could also pick every probability from 0.1 to 0.9, incrementing by 0.1; calculate the 
probability of the observed data in each distribution; and develop our hypothesis from that. Figure 
5-1 illustrates what each value looks like in the latter case. 



Probability 


Probability of different values for p given observation 



Figure 5-1: Visualization of different hypotheses about the rate ofgetting two quarters 

Even with all these hypotheses, there’s no way we could cover every possible eventuality because 
we’re not working with a finite number of hypotheses. So let’s try to get more information by 
testing more distributions. If we repeat the last experiment, testing each possibility at certain 
increments starting with 0.01 and ending with 0.99, incrementing by only 0.01 would give us the 
results in Figure 5-2 . 
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Figure 5-2: We see a definite pattern emerging when we look at more hypotheses. 

We may not be able to test every possible hypothesis, but it’s clear a pattern is emerging here: we 
see something that looks like a distribution representing what we believe is the behavior of the 
black box. 

This seems like valuable information; we can easily see where the probability is highest. Our goal, 
however, is to model our beliefs in all possible hypotheses (that is, the full probability distribution 
of our beliefs). There are still two problems with our approach. First, because there’s an infinite 
number of possible hypotheses, incrementing by smaller and smaller amounts doesn’t accurately 
represent the entire range of possibilities—we’re always missing an infinite amount. In practice, 
this isn’t a huge problem because we often don’t care about the extremes like 0.000001 and 
0.0000011, but the data would be more useful if we could represent this infinite range of 
possibilities a bit more accurately. 

Second, if you looked at the graph closely, you may have noticed a larger problem here: there are at 
least 10 dots above 0.1 right now, and we have an infinite number of points to add. This means that 
our probabilities don’t sum to 1 ! From the rules of probability, we know that the probabilities of all 
our possible hypotheses must sum to 1. If they don’t, it means that some hypotheses are not 
covered. If they add up to more than 1, we would be violating the rule that probabilities must be 
between 0 and 1. Even though there are infinitely many possibilities here, we still need them all to 
sum to 1. This is where the beta distribution comes in. 














THE BETA DISTRIBUTION 


To solve both of these problems, we'll be using the beta distribution. Unlike the binomial 
distribution, which breaks up nicely into discrete values, the beta distribution represents a 
continuous range of values, which allows us to represent our infinite number of possible 
hypotheses. 


We define the beta distribution with a probability density function (PDF), which is very similar to 
the probability mass function we use in the binomial distribution, but is defined for continuous 
values. Here is the formula for the PDF of the beta distribution: 


Beta(/?;a,p) = 


beta(a,p) 


1-1 


Now this looks like a much more terrifying formula than the one for our binomial distribution! But 
it’s actually not that different. We won’t build this formula entirely from scratch like we did with the 
probability mass function, but let’s break down some of what’s happening here. 


Breaking Down the Probability Density Function 

Let’s first take a look at our parameters: p, a (lowercase Greek letter alpha), and (B (lowercase Greek 
letter beta). 

p Represents the probability of an event. This corresponds to our different hypotheses for the 
possible probabilities for our black box. 

a Represents how many times we observe an event we care about, such as getting two quarters 
from the box. 

P Represents how many times the event we care about didn’t happen. For our example, this is the 
number of times that the black box ate the quarter. 


The total number of trials is a + (3. This is different than the binomial distribution, where we 
have k observations we’re interested in and a finite number of n total trials. 


The top part of the PDF function should look pretty familiar because it’s almost the same as the 
binomial distribution’s PMF, which looks like this: 


n 




B(k;n,p) = I x p k x(l- p) 

n , 


n-k 


In the PDF, rather than p'< x (1 - pfy we have p«-' x (1 - p) m where we subtract 1 from the exponent 
terms. We also have another function in the denominator of our equation: the hetafunction (note 
the lowercase) for which the beta distribution is named. We subtract 1 from the exponent and use 
the beta function to normalize our values—this is the part that ensures our distribution sums to 1. 
The beta function is the integral from 0 to 1 of p-*- 1 x (1 - p)M. We'll talk about integrals more in the 
next section, but you can think of this as the sum of all the possible values of p^-i x (1 - pf- 
1 when p is every number between 0 and 1. A discussion of how subtracting 1 from the exponents 
and dividing by the beta functions normalizes our values is beyond the scope of this chapter; for 
now, you just need to know that this allows our values to sum to 1, giving us a workable probability. 
What we get in the end is a function that describes the probability of each possible hypothesis for 
our true belief in the probability of getting two heads from the box, given that we have observed a 
examples of one outcome and p examples of another. Remember that we arrived at the beta 
distribution by comparing how well different binomial distributions, each with its own 



probability p, described our data. In other words, the beta distribution represents how well all 
possible binomial distributions describe the data observed. 

Applying the Probability Density Function to Our Problem 

When we plug in our values for our black box data and visualize the beta distribution, shown 
in Figure 5-3. we see that it looks like a smooth version of the plot in Figure 5-2 . This illustrates the 
PDF of Beta(14,27). 

Distribution for Beta( 14,27) 



P 

Figure 5-3: Visualizing the beta distribution for our data collected about the black box 


As you can see, most of the plot’s density is less than 0.5, as we would expect given that our data 
shows that fewer than half of the quarters placed in the black box returned two quarters. 

The plot also shows that it’s very unlikely the black box will return two quarters at least half the 
time, which is the point at which we break even if we continually put quarters in the box. We’ve 
figured out that we’re more likely to lose money than make money through the box, without 
sacrificing too many quarters. While we can see the distribution of our beliefs by looking at a plot, 
we’d still like to be able to quantify exactly how strongly we believe that "the probability that the 
true rate at which the box returns two quarters is less than 0.5." To do this, we need just a bit of 
calculus (and some R). 







Quantifying Continuous Distributions with Integration 

The beta distribution is fundamentally different from the binomial distribution in that with the 
latter, we are looking at the distribution of k, the number of outcomes we care about, which is 
always something we can count. For the beta distribution, however, we are looking at the 
distribution of p, for which we have an infinite number of possible values. This leads to an 
interesting problem that might be familiar if you’ve studied calculus before (but it’s okay if you 
haven’t!). For our example of a=14 and (3=27, we want to know: what is the probability that the 
chance of getting two coins is 1/2? 

While it’s easy to ask the likelihood of an exact value with the binomial distribution thanks to its 
finite number of outcomes, this is a really tricky question for a continuous distribution. We know 
that the fundamental rule of probability is that the sum of all our values must be 1, but each of our 
individual values is infinitely small, meaning the probability of any specific value is in practice 0. 
This may seem strange if you aren’t familiar with continuous functions from calculus, so as a quick 
explanation: this is just the logical consequence of having something made up of an infinite number 
of pieces. Imagine, for example, you divide a 1-pound bar of chocolate (pretty big!) into two pieces. 
Each piece would then weigh 1/2 a pound. If you divided it into 10 pieces, each piece would weigh 
1/10 a pound. As the number of pieces you divide the chocolate into grows, each piece becomes so 
small you can’t even see it. For the case where the number of pieces goes to infinity, eventually 
those pieces disappear! 

Even though the individual pieces disappear, we can still talk about ranges. For example, even if we 
divided a 1-pound bar of chocolate into infinitely many pieces, we can still add up the weight of the 
pieces in one half of the chocolate bar. Similarly, when talking about probability in continuous 
distributions, we can sum up ranges of values. But if every specific value is 0, then isn’t the sum just 
0 as well? 

This is where calculus comes in: in calculus, there’s a special way of summing up infinitely small 
values called the integral. If we want to know whether the probability that the box will return a coin 
is less than 0.5 (that is, the value is somewhere between 0 and 0.5), we can sum it up like this: 

r /,»- x (i -pf-' 

1° beta(14,27) 

If you’re rusty on calculus, the stretched-out 5 is the continuous function equivalent to £ for 
discrete functions. It’s just a way to express that we want to add up all the little bits of our function 
(see Appendix B for a quick overview of the basic principles of calculus). 

If this math is starting to look too scary, don’t worry! We’ll use R to calculate this for us. R includes a 
function called dbeta() that is the PDF for the beta distribution. This function takes three arguments, 
corresponding to p, a, and (3. We use this together with R’s integrate!) function to perform this 
integration automatically. Here we calculate the probability that the chance of getting two coins 
from the box is 0.5, given the data: 

> integrate(function(p) dbeta(p,14,27),0,0.5) 

The result is as follows: 


0.9807613 with absolute error < 5.9e-06 






The "absolute error" message appears because computers can’t perfectly calculate integrals so 
there is always some error, though usually it is far too small for us to worry about. This result from 
R tells us that there is a 0.98 probability that, given our evidence, the true probability of getting two 
coins out of the black box is less than 0.5. This means it would not be good idea to put any more 
quarters in the box, since you very likely won’t break even. 

REVERSE-ENGINEERING THE GACHA GAME 

In real-life situations, we almost never know the true probabilities for events. That’s why the beta 
distribution is one of our most powerful tools for understanding our data. In the Gacha game 
in Chapter 4. we knew the probability of each card we wanted to pull. In reality, the game 
developers are very unlikely to give players this information, for many reasons (such as not 
wanting players to calculate how unlikely they are to get the card they want). Now suppose we are 
playing a new Gacha game called Frequentist Fighters! and it also features famous statisticians. This 
time, we are pulling for the Bradley Efron card. 

We don’t know the rates for the card, but we really want that card—and more than one if possible. 
We spend a ridiculous amount of money and find that from 1,200 cards pulled, we received only 5 
Bradley Efron cards. Our friend is thinking of spending money on the game but only wants to do it if 
there is a better than 0.7 probability that the chance of pulling a Bradley Efron is greater than 
0.005. 

Our friend has asked us to figure out whether he should spend the money and pull. Our data tells us 
that of 1,200 cards pulled, only 5 were Bradley Efron, so we can visualize this as Beta(5,1195), 
shown in Figure 5-4 (remember that the total cards pulled is a + (3). 



Pulling a Bradley Efron Card, Beta(5,l 195) 



Figure 5-4: The beta distribution for getting a Bradley Efron card given our data 

From our visualization we can see that nearly all the probability density is below 0.01. We need to 
know exactly how much is above 0.005, the value that our friend cares about. We can solve this by 
integrating over the beta distribution in R, as earlier: 


integrate(function(x) dbeta(x,5,1195),0.005,1) 
0.29 


This tells us the probability that the rate of pulling a Bradley Efron card is 0.005 or greater, given 
the evidence we have observed, is only 0.29. Our friend will pull for this card only if the probability 
is around 0.7 or greater, so based on the evidence from our data collection, our friend should not try 
his luck. 

WRAPPING UP 

In this chapter, you learned about the beta distribution, which is closely related to the binomial 
distribution but behaves quite differently. We built up to the beta distribution by observing how 
well an increasing number of possible binomial distributions explained our data. Because our 
number of possible hypotheses was infinite, we needed a continuous probability distribution that 











could describe all of them. The beta distribution allows us to represent how strongly we believe in 
all possible probabilities for the data we observed. This enables us to perform statistical inference 
on observed data by determining which probabilities we might assign to an event and how strongly 
we believe in each one: a probability of probabilities. 

The major difference between the beta distribution and the binomial distribution is that the beta 
distribution is a continuous probability distribution. Because there are an infinite number of values 
in the distribution, we cannot sum results the same way we do in a discrete probability distribution. 
Instead, we need to use calculus to sum ranges of values. Fortunately, we can use R instead of 
solving tricky integrals by hand. 

EXERCISES 

Try answering the following questions to make sure you understand how we can use the Beta 
distribution to estimate probabilities. The solutions can be found 
at https://nostarch.com/learnbaves/ . 

1. You want to use the beta distribution to determine whether or not a coin you have is 
a fair coin—meaning that the coin gives you heads and tails equally. You flip the coin 10 
times and get 4 heads and 6 tails. Using the beta distribution, what is the probability that the 
coin will land on heads more than 60 percent of the time? 

2. You flip the coin 10 more times and now have 9 heads and 11 tails total. What is the 
probability that the coin is fair, using our definition of fair, give or take 5 percent? 

3. Data is the best way to become more confident in your assertions. You flip the coin 
200 more times and end up with 109 heads and 111 tails. Now what is the probability that 
the coin is fair, give or take 5 percent? 



PART II 

BAYESIAN PROBABILITY AND PRIOR PROBABILITIES 
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CONDITIONAL PROBABILITY 



So far, we have dealt only with independent probabilities. Probabilities are independent when the 
outcome of one event does not affect the outcome of another. For example, flipping heads on a coin 
doesn’t impact whether or not a die will roll a 6. Calculating probabilities that are independent is 
much easier than calculating probabilities that aren’t, but independent probabilities often don’t 
reflect real life. For example, the probability that your alarm doesn’t go off and the probability that 
you’re late for work are not independent. If your alarm doesn’t go off, you are far more likely to be 
late for work than you would otherwise be. 

In this chapter, you'll learn how to reason about conditional probabilities, where probabilities are 
not independent but rather depend on the outcome of particular events. I'll also introduce you to 
one of the most important applications of conditional probability: Bayes’ theorem. 


INTRODUCING CONDITIONAL PROBABILITY 


In our first example of conditional probabilities, we’ll look at flu vaccines and possible 
complications of receiving them. When you get a flu vaccine, you’re typically handed a sheet of 
paper that informs you of the various risks associated with it. One example is an increased 
incidence of Guillain-Barre syndrome (GBS), a very rare condition that causes the body’s immune 
system to attack the nervous system, leading to potentially life-threatening complications. 
According to the Centers for Disease Control and Prevention (CDC), the probability of contracting 
GBS in a given year is 2 in 100,000. We can represent this probability as follows: 


p(gbs) = 


2 

100,000 


Normally the flu vaccine increases your probability of getting GBS only by a trivial amount. In 2010, 
however, there was an outbreak of swine flu, and the probability of getting GBS if you received the 
flu vaccine that year rose to 3/100,000. In this case, the probability of contracting GBS directly 
depended on whether or not you got the flu vaccine, and thus it is an example of a conditional 
probability. We express conditional probabilities as P[A \ B), or the probability of A given B. 
Mathematically, we can express the chance of getting GBS as: 

P(GBS | flu vaccine) =-—- 

100,000 


We read this expression in English as "The probability of having GBS, given that you got the flu 
vaccine, is 3 in 100,000." 




Why Conditional Probabilities Are Important 

Conditional probabilities are an essential part of statistics because they allow us to demonstrate 
how information changes our beliefs. In the flu vaccine example, if you don’t know whether or not 
someone got the vaccine, you can say that their probability of getting GBS is 2/100,000 since this is 
the probability that any given person picked out of the population would have GBS that year. If the 
year is 2010 and a person tells you that they got the flu shot, you know that the true probability is 
3/100,000. We can also look at this as a ratio of these two probabilities, like so: 

P (GBS | flu vaccine) 

P(GBS) 

So if you had the flu shot in 2010, we have enough information to believe you’re 50 percent more 
likely to get GBS than a stranger picked at random. Fortunately, on an individual level, the 
probability of getting GBS is still very low. But if we’re looking at populations as a whole, we would 
expect 50 percent more people to have GBS in a population of people that had the flu vaccine than 
in the general population. 

There are also other factors that can increase the probability of getting GBS. For example, males and 
older adults are more likely to have GBS. Using conditional probabilities, we can add all of this 
information to better estimate the likelihood that an individual gets GBS. 

Dependence and the Revised Rules of Probability 

As a second example of conditional probabilities, we'll use color blindness, a vision deficiency that 
makes it difficult for people to discern certain colors. In the general population, about 4.25 percent 
of people are color blind. The vast majority of cases of color blindness are genetic. Color blindness 
is caused by a defective gene in the X chromosome. Because males have only a single X chromosome 
and females have two, men are about 16 times more likely to suffer adverse effects of a defective X 
chromosome and therefore to be color blind. So while the rate of color blindness for the entire 
population is 4.25 percent, it is only 0.5 percent in females but 8 percent in males. For all of our 
calculations, we’ll be making the simplifying assumption that the male/female split of the 
population is exactly 50/50. Let’s represent these facts as conditional probabilities: 

P(color blind) = 0.0425 
P(color blind | female) = 0.005 
P(color blind | male) = 0.08 

Given this information, if we pick a random person from the population, what’s the probability that 
they are male and color blind? 

In Chapter 3. we learned how we can combine probabilities with AND using the product rule. 
According to the product rule, we would expect the result of our question to be: 

P(male, color blind) = P(male) x P(color blind) = 0.5 x 0.0425 = 0.02125 

But a problem arises when we use the product rule with conditional probabilities. The problem 
becomes clearer if we try to find the probability that a person is female and color blind: 

P(female, color blind) = P(female) x P(color blind) = 0.5 x 0.0425 = 0.02125 

This can’t be right because the two probabilities are the same! We know that, while the probability 
of picking a male or a female is the same, if we pick a female, the probability that she is color blind 



should be much lower than for a male. Our formula should account for the fact that if we pick 
our person at random, then the probability that they are color blind depends on whether they are 
male or female. The product rule given in Chapter 3 works only when the probabilities are 
independent. Being male (or female) and color blind are dependent probabilities. 

So the true probability of finding a male who is color blind is the probability of picking a male 
multiplied by the probability that he is color blind. Mathematically, we can write this as: 

P(male, color blind) = P(male) x P(color blind | male) = 0.5 x 0.08 = 0.04 

We can generalize this solution to rewrite our product rule as follows: 

P[A,B) = P(A) x P{B | A ) 

This definition works for independent probabilities as well, because for independent 
probabilities P(B ) = P[B \ A). This makes intuitive sense when you think about flipping heads and 
rolling a 6; because P(six) is 1/6 independent of the coin toss, P(six | heads) is also 1/6. 

We can also update our definition of the sum rule to account for this fact: 

P[A or B ) = P(A) + P(P) - P(A) x P{B \ A ) 

Now we can still easily use our rules of probabilistic logic from Part I and handle conditional 
probabilities. 

An important thing to note about conditional probabilities and dependence is that, in practice, 
knowing how two events are related is often difficult. For example, we might ask about the 
probability of someone owning a pickup truck and having a work commute of over an hour. While 
we can come up with plenty of reasons one might be dependent on the other—maybe people with 
pickup trucks tend to live in more rural areas and commute less—we might not have the data to 
support this. Assuming that two events are independent (even when they likely aren’t) is a very 
common practice in statistics. But, as with our example for picking a color blind male, this 
assumption can sometimes give us very wrong results. While assuming independence is often a 
practical necessity, never forget how much of an impact dependence can have. 

CONDITIONAL PROBABILITIES IN REVERSE AND BAYES' 
THEOREM 

One of the most amazing things we can do with conditional probabilities is reversing the condition 
to calculate the probability of the event we’re conditioning on; that is, we can use P[A \ B ) to arrive 
at P[B | A). As an example, say you’re emailing a customer service rep at a company that sells color 
blindness-correcting glasses. The glasses are a little pricey, and you mention to the rep that you’re 
worried they might not work. The rep replies, "I’m also color blind, and I have a pair myself—they 
work really well!" 

We want to figure out the probability that this rep is male. However, the rep provides no 
information except an ID number. So how can we figure out the probability that the rep is male? 
We know that P(color blind | male) = 0.08 and that P(color blind | female) = 0.005, but how can we 
determine P(male | color blind)? Intuitively, we know that it is much more likely that the customer 
service rep is in fact male, but we need to quantify that to be sure. 

Thankfully, we have all the information we need to solve this problem, and we know that we are 
solving for the probability that someone is male, given that they are color blind: 



P(male | color blind) = ? 

The heart of Bayesian statistics is data, and right now we have only one piece of data (other than 
our existing probabilities): we know that the customer support rep is color blind. Our next step is to 
look at the portion of the total population that is color blind; then, we can figure out what portion of 
that subset is male. 

To help reason about this, let’s add a new variable N, which represents the total population of 
people. As stated before, we first need to calculate the total subset of the population that is color 
blind. We know P(color blind), so we can write this part of the equation like so: 

P 

P(male I color blind) = —- - : --- 

v P(color blind) x N 

Next we need to calculate the number of people who are male and color blind. This is easy to do 
since we know P(male) and P(color blind | male), and we have our revised product rule. So we can 
simply multiply this probability by the population: 

P(male) x P(color blind | male) x N 


So the probability that the customer service rep is male, given that they’re color blind, is: 

. . P(male)x P (color blind I male)x N 

P male | color blind) = —^^-—!- L - 

v P(color blind)x N 

Our population variable N is on both the top and the bottom of the fraction, so the Ns cancel out: 

. , P(male)x P(color blind I male) 

P(male | color blind) = —-- 

P(color blind) 


We can now solve our problem since we know each piece of information: 


. . P( male) x P(color blind I male) 

P(male | color blind) = —^■— K - -—!-= 

P(color blind) 


0.5x0.08 

0.0425 


0.941 


Given the calculation, we know there is a 94.1 percent chance that the customer service rep is in 
fact male! 


INTRODUCING BAYES' THEOREM 


There is nothing actually specific to our case of color blindness in the preceding formula, so we 
should be able to generalize it to any given A and B probabilities. If we do this, we get the most 
foundational formula in this book, Bayes’ theorem-. 


p(a\b) = 


P(A)P(B\A) 

P(B) 


To understand why Bayes’ theorem is so important, let’s look at a general form of this problem. Our 
beliefs describe the world we know, so when we observe something, its conditional probability 
represents the likelihood of what we've seen given what we believe, or: 


P(observed | belief) 



For example, suppose you believe in climate change, and therefore you expect that the area where 
you live will have more droughts than usual over a 10-year period. Your belief is that climate 
change is taking place, and your observation is the number of droughts in your area; let’s say there 
were 5 droughts in the last 10 years. Determining how likely it is that you’d see exactly 5 droughts 
in the past 10 years if there were climate change during that period may be difficult. One way to do 
this would be to consult an expert in climate science and ask them the probability of droughts given 
that their model assumes climate change. 

At this point, all you’ve done is ask, "What is the probability of what I've observed, given that I 
believe climate change is true?” But what you want is some way to quantify how strongly you 
believe climate change is really happening, given what you have observed. Bayes’ theorem allows 
you to reverse P(observed | belief), which you asked the climate scientist for, and solve for the 
likelihood of your beliefs given what you've observed, or: 

Pfbelief | observed) 

In this example, Bayes’ theorem allows you to transform your observation of five droughts in a 
decade into a statement about how strongly you believe in climate change after you have observed 
these droughts. The only other pieces of information you need are the general probability of 5 
droughts in 10 years (which could be estimated with historical data) and your initial certainty of 
your belief in climate change. And while most people would have a different initial probability for 
climate change, Bayes’ theorem allows you to quantify exactly how much the data changes any 
belief. 

For example, if the expert says that 5 droughts in 10 years is very likely if we assume that climate 
change is happening, most people will change their previous beliefs to favor climate change a little, 
whether they’re skeptical of climate change or they’re A1 Gore. 

However, suppose that the expert told you that in fact, 5 droughts in 10 years was very unlikely 
given your assumption that climate change is happening. In that case, your prior belief in climate 
change would weaken slightly given the evidence. The key takeaway here is that Bayes’ theorem 
ultimately allows evidence to change the strength of our beliefs. 

Bayes’ theorem allows us to take our beliefs about the world, combine them with data, and then 
transform this combination into an estimate of the strength of our beliefs given the evidence we’ve 
observed. Very often our beliefs are just our initial certainty in an idea; this is the P(A) in Bayes’ 
theorem. We often debate topics such as whether gun control will reduce violence, whether 
increased testing increases student performance, or whether public health care will reduce overall 
health care costs. But we seldom think about how evidence should change our minds or the minds 
of those we’re debating. Bayes’ theorem allows us to observe evidence about these beliefs and 
quantify exactly how much this evidence changes our beliefs. 

Later in this book, you’ll see how we can compare beliefs as well as cases where data can 
surprisingly fail to change beliefs (as anyone who has argued with relatives over dinner can attest!). 
In the next chapter, we’re going to spend a bit more time with Bayes’ theorem. We’ll derive it once 
more, but this time with LEGO; that way, we can clearly visualize how it works. We’ll also explore 
how we can understand Bayes’ theorem in terms of more specifically modeling our existing beliefs 
and how data changes them. 

WRAPPING UP 

In this chapter, you learned about conditional probabilities, which are any probability of an event 
that depends on another event. Conditional probabilities are more complicated to work with than 



independent probabili-ties—we had to update our product rule to account for dependencies—but 
they lead us to Bayes’ theorem, which is fundamental to understanding how we can use data to 
update what we believe about the world. 

EXERCISES 

Try answering the following questions to see how well you understand conditional probability and 
Bayes’ theorem. The solutions can be found at https://nostarch.com/learnbayes/ . 

• What piece of information would we need in order to use Bayes’ theorem to 
determine the probability that someone in 2010 who had GBS also had the flu vaccine that 
year? 

• What is the probability that a random person picked from the population is female 
and is notcolor blind? 

• What is the probability that a male who received the flu vaccine in 2010 is either 
color blind or has GBS? 
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BAYES' THEOREM WITH LEGO 



In the previous chapter, we covered conditional probability and arrived at a very important idea in 
probability, Bayes’ theorem, which states: 



Notice that here we've made a very small change from Chapter 6. writing P(B \ A)P(A) instead 
of P(A)P[B | A ); the meaning is identical, but sometimes changing the terms around can help clarify 
different approaches to problems. 

With Bayes’ theorem, we can reverse conditional probabilities—so when we know the 
probability P[B \ A ), we can work out P[A \ B). Bayes’ theorem is foundational to statistics because it 
allows us to go from having the probability of an observation given a belief to determining the 
strength of that belief given the observation. For example, if we know the probability of sneezing 
given that you have a cold, we can work backward to determine the probability that you have a cold 
given that you sneezed. In this way, we use evidence to update our beliefs about the world. 

In this chapter, we’ll use LEGO to visualize Bayes’ theorem and help solidify the mathematics in 
your mind. To do this, let’s pull out some LEGO bricks and put some concrete questions to our 
equation. Figure 7-1 shows a 6 x 10 area of LEGO bricks; that’s a 60-stud area (studs are the 
cylindrical bumps on LEGO bricks that connect them to each other). 






Figure 7-1: A 6 x 10-stud LEGO area to help us visualize the space of possible events 


We can imagine this as the space of 60 possible, mutually exclusive events. For example, the blue 
studs could represent 40 students who passed an exam and the red studs 20 students who failed 
the exam in a class of 60. In the 60-stud area, there are 40 blue studs, so if we put our finger on a 
random spot, the probably of touching a blue brick is defined like this: 


40 

P(blue) = — 
60 


2 

3 


We would represent the probability of touching a red brick as follows: 

P(red) = — = i 
60 3 


The probability of touching either a blue or a red brick, as you would expect, is 1: 


P(blue) + P(red) = 1 


This means that red and blue bricks alone can describe our entire set of possible events. 

Now let’s put a yellow brick on top of these two bricks to represent some other possibility—for 
example, the students that pulled an all-nighter studying and didn’t sleep—so it looks like Figure 7- 
2 . 



Figure 7-2: Placing a 2 x 3 LEGO brick on top of the 6 x 10-stud LEGO area 


Now if we pick a stud at random, the probability of touching the yellow brick is: 

/ v 6 1 

P(vellow) = — = — 
v 1 60 10 


But if we add P(yellow) to P(red) + P(blue), we’d get a result greater than 1, and that’s impossible! 
The issue, of course, is that our yellow studs all sit on top of the space of red and blue studs, so the 
probability of getting a yellow brick is conditional on whether we’re on a blue or red space. As we 


know from the previous chapter, we can express this conditional probability as P(yellow | red), 
or the probability of yellow given red. Given our example from earlier, this would be the probability 
that a student pulled an all-nighter, given that they had failed an exam. 


WORKING OUT CONDITIONAL PROBABILITIES VISUALLY 


Let’s go back to our LEGO bricks and work out P(yellow | red). Figure 7-3 gives us a bit of visual 
insight into the problem. 



Figure 7-3: Visualizing P (yellow / red) 

Let’s walk through the process for determining P(yellow | red) by working with our physical 
representation: 

1. Split the red section off from the blue. 



2. Get the area of the entire red space; it’s a 2 x 10-stud area, so that’s 20 studs. 

3. Get the area of the yellow block on the red space, which is 4 studs. 

4. Divide the area of the yellow block by the area of the red block. 

This gives us P(yellow | red) =4/20 = 1/5. 

Great—we have arrived at the conditional probability of yellow given red! So far, so good. So what if 
we now reverse that conditional probability and ask what is P(red | yellow)? In plain English, if we 
know we are on a yellow space, what is the probability that it’s red underneath? Or, in our test 
example, what is the probability that a student failed the exam, given that they pulled an all- 
nighter? 

Looking at Figure 7-3, you may have intuitively figured out P(red | yellow)by reasoning, "There are 
6 yellow studs, 4 of which are over red, so the probability of choosing a yellow that’s over a red 
block is 4/6.” If you did follow this line of thinking, then congratulations! You just independently 
discovered Bayes’ theorem. But let’s quantify that with math to make sure it’s right. 


WORKING THROUGH THE MATH 

Getting from our intuition to Bayes’ theorem will require a bit of work. Let’s begin formalizing our 
intuition by coming up with a way to calculate that there are 6 yellow studs. Our minds arrive at 
this conclusion through spatial reasoning, but we need to use a mathematical approach. To solve 
this, we just take the probability of being on a yellow stud multiplied by the total number of studs: 

numberOfYellowStuds = P(yellow) x totalStuds = — x 60 = 6 


The next part of our intuitive reasoning is that 4 of the yellow studs are over red, and this requires a 
bit more work to prove mathematically. First, we have to establish how many red studs there are; 
luckily, this is the same process as calculating yellow studs: 

numberOfRedStuds = P(red) x totalStuds = ^ x 60 = 20 

We’ve also already figured out the ratio of red studs covered by yellow as P(yellow | red). To make 
this a count—rather than a probability—we multiply it by the number of red studs that we just 
calculated: 

numberOfRedStuds = P(yellow | red) x numberOfRedStuds = — x 20 = 4 

5 


Finally, we get the ratio of the red studs covered by yellow to the total number of yellow: 

, , * numberOfRedUnderYellow 4 2 

PI red yellow) =-= — = — 

v numberOfYellowStuds 6 3 


This lines up with our intuitive analysis. However, it doesn’t quite look like a Bayes’ theorem 
equation, which should have the following structure: 


P(A\B) = 


P(B\A)P(A ) 
HB) 


To get there we’ll have to go back and expand the terms in this equation, like so: 

. v P( yellow I red)x numberOfRedStuds 

P (red | yellow) = —^- 

P(yellow) x totalStuds 



We know that we calculate this as follows: 


P(red | yellow) = 


P(yellow | red) P(retl) x totalStuds 
P(yellow)x totalStuds 


Finally, we just need to cancel out totalStuds from the equation, which gives us: 


P(red | yellow) = 


P(yellow | red)P(red) 
P( yellow) 


From intuition, we have arrived back at Bayes’ theorem! 


WRAPPING UP 

Conceptually, Bayes’ theorem follows from intuition, but that doesn’t mean that the formalization of 
Bayes’ theorem is obvious. The benefit of our mathematical work is that it extracts reason out of 
intuition. We’ve confirmed that our original, intuitive beliefs are consistent, and now we have a 
powerful new tool to deal with problems in probability that are more complicated than LEGO 
bricks. 

In the next chapter, we’ll take a look at how to use Bayes’ theorem to reason about and update our 
beliefs using data. 


EXERCISES 

Try answering the following questions to see if you have a solid understanding of how we can use 
Bayes’ Theorem to reason about conditional probabilities. The solutions can be found 
at https://nostarch.com/Iearnbaves/ . 

1. Kansas City, despite its name, sits on the border of two US states: Missouri and 
Kansas. The Kansas City metropolitan area consists of 15 counties, 9 in Missouri and 6 in 
Kansas. The entire state of Kansas has 105 counties and Missouri has 114. Use Bayes' 
theorem to calculate the probability that a relative who just moved to a county in the Kansas 
City metropolitan area also lives in a county in Kansas. Make sure to show P(Kansas) 
(assuming your relative either lives in Kansas or Missouri), P(Kansas City metropolitan 
area), and P(Kansas City metropolitan area | Kansas). 

2. A deck of cards has 52 cards with suits that are either red or black. There are four 
aces in a deck of cards: two red and two black. You remove a red ace from the deck and 
shuffle the cards. Your friend pulls a black card. What is the probability that it is an ace? 
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THE PRIOR, LIKELIHOOD, AND POSTERIOR OF BAYES' 

THEOREM 



Now that we've covered how to derive Bayes’ theorem using spatial reasoning, let’s examine how 
we can use Bayes’ theorem as a probability tool to logically reason about uncertainty. In this 
chapter, we'll use it to calculate and quantify how likely our belief is, given our data. To do so, we’ll 
use the three parts of the theorem—the posterior probability, likelihood, and prior probability—all 
of which will come up frequently in your adventures with Bayesian statistics and probability. 

THE THREE PARTS 

Bayes’ theorem allows us to quantify exactly how much our observed data changes our beliefs. In 
this case, what we want to know is: Pfbelief | data). In plain English, we want to quantify how 
strongly we hold our beliefs given the data we’ve observed. The technical term for this part of the 
formula is the posterior probability, and it’s what we’ll use Bayes’ theorem to solve for. 

To solve for the posterior, we need the next part: the probability of the data given our beliefs about 
the data, or P(data | belief). This is known as the likelihood, because it tells us how likely the data is 
given our belief. 

Finally, we want to quantify how likely our initial belief is in the first place, or P(belief). This part of 
Bayes’ theorem is called the prior probability, or simply "the prior," because it represents the 
strength of our belief before we see the data. The likelihood and the prior combine to produce a 
posterior. Typically we need to use the probability of the data, P(data), in order to normalize our 
posterior so it accurately reflects a probability from 0 to 1. However, in practice, we don't always 
need P(data), so this value doesn’t have a special name. 

As you know by now, we refer to our belief as a hypothesis, H, and we represent our data with the 
variable D. Figure 8-1 shows each part of Bayes’ theorem. 



Likelihood 


The posterior probability 


P(belief I data) = 


Prior probability 

I 

P(data I belief) P (belief) 


r 


P(data) 


Normalizes our probabilities 


Figure 8-1: The parts of Bayes’ theorem 

In this chapter, we’ll investigate a crime, combining these pieces to reason about the situation. 


INVESTIGATING THE SCENE OF A CRIME 

Let’s suppose you come home from work one day and find your window broken, your front door 
open, and your laptop missing. Your first thought is probably "I've been robbed!" But how did you 
come to this conclusion, and more importantly, how can you quantify this belief? 

Your immediate hypothesis is that you have been robbed, so H - I’ve been robbed. We want a 
probability that describes how likely it is that you’ve been robbed, so the posterior we want to solve 
for given our data is: 

P(robbed | broken window, open front door, missing laptop) 

To solve this problem, we’ll fill in the missing pieces from Bayes’ theorem. 

Solving for the Likelihood 

First, we need to solve for the likelihood, which in this case is the probability that the same 
evidence would have been observed if you were in fact robbed—in other words, how closely the 
evidence lines up with the hypothesis: 

Pfbroken window, open front door, missing laptop | robbed) 

What we’re asking is, "If you were robbed, how likely is it that you would see the evidence you saw 
here?" You can imagine a wide range of scenarios where not all of this evidence was present at a 
robbery. For example, a clever thief might have picked the lock on your door, stolen your laptop, 
then locked the door behind them and not needed to break a window. Or they might have just 
smashed the window, taken the laptop, and then climbed right back out the window. The evidence 
we’ve seen seems intuitively like it would be pretty common at the scene of a robbery, so we’ll say 
there’s a 3/10 probability that if you were robbed, you would come home and find this evidence. 

It’s important to note that, even though we’re making a guess in this example, we could also do 
some research to get a better estimate. We could go to the local police department and ask for 
statistics about evidence at crime scenes involving robbery, or read through news reports of recent 
robberies. This would give us a more accurate estimate for the likelihood that if you were robbed 
you would see this evidence. 



The incredible thing about Bayes’ theorem is that we can use it both for organizing our casual 
beliefs and for working with large data sets of very exact probabilities. Even if you don’t think 3/10 
is a good estimate, you can always go back to the calculations—as we will do—and see how the 
value changes given a different assumption. For example, if you think that the probability of seeing 
this evidence given a robbery is just 3/100, you can easily go back and plug in those numbers 
instead. Bayesian statistics lets people disagree about beliefs in a measurable way. Because we are 
dealing with our beliefs in a quantitative way, you can recalculate everything we do in this chapter 
to see if this different probability has a substantial impact on any of the final outcomes. 


Calculating the Prior 

Next, we need to determine the probability that you would get robbed at all. This is our prior. Priors 
are extremely important, because they allow us to use background information to adjust a 
likelihood. For example, suppose the scene described earlier happened on a deserted island where 
you are the only inhabitant. In this case, it would be nearly impossible for you to get robbed (by a 
human, at least). In another example, if you owned a home in a neighborhood with a high crime 
rate, robberies might be a frequent occurrence. For simplicity, let’s set our prior for being robbed 
as: 

1 

1,000 



Remember, we can always adjust these figures later given different or additional evidence. 

We have nearly everything we need to calculate the posterior; we just need to normalize the data. 
Before moving on, then, let’s look at the unnormalized posterior: 


P (robbed)x P 


r broken window, open front door, ' 
^ missing laptop | robbed 


_3 _ 

10,000 


This value is incredibly small, which is surprising since intuition tells us that the probability of your 
house being robbed given the evidence you observed seems very, very high. But we haven’t yet 
looked at the probability of observing our evidence. 


Normalizing the Data 


What’s missing from our equation is the probability of the data you observed whether or not you 
were robbed. In our example, this is the probability that you observe that your window is broken, 
the door is open, and your laptop is missing all at once, regardless of the cause. As of now, our 
equation looks like this: 



robbed | broken window, 
open front door, missing laptop. 


1 


3 


x 


1,000 10 

P(D) 


The reason the probability in the numerator is so low is that we haven’t normalized it with the 
probability that you would find this strange evidence. 

We can see how our posterior changes as we change our P[D ) in Table 8-1 . 


Table 8-1: How the P{D ) Affects the Posterior 



P(D) 

Posterior 


0.050 

0.006 


0.010 

0.030 


0.005 

0.060 


0.001 

0.300 



As the probability of our data decreases, our posterior probability increases. This is because as the 
data we observe becomes increasingly unlikely, a typically unlikely explanation does a better job of 
explaining the event (see Figure 8-2 T 

Pjrobbed I window,door,laptop) 



Pjrobbed I window,door,laptop) 



Figure 8-2: As the probability of the data decreases, the posterior probability increases. 

Consider this extreme example: the only way your friend could become a millionaire is if they won 
the lottery or inherited money from some family member they didn’t know existed. Your friend 
becoming a millionaire is therefore shockingly unlikely. However, you find out that your 
friend did become a millionaire. The possibility that your friend won the lottery then becomes much 
more likely, because it is one of the only two ways they could have become a millionaire. 





















Being robbed is, of course, only one possible explanation for what you observed, and there are 
many more explanations. However, if we don’t know the probability of the evidence, we can’t figure 
out how to normalize all these other possibilities. So what is our P[D')1 That’s the tricky part. 

The common problem with P[D ) is that it’s very difficult to accurately calculate in many real-world 
cases. With every other part of the formula—even though we just guessed at a value for this 
exercise—we can collect real data to provide a more concrete probability. For our prior, P(robbed), 
we might simply look at historical crime data and pin down a probability that a given house on your 
street would be robbed any given day. Likewise, we could, theoretically, investigate past robberies 
and come up with a more accurate likelihood for observing the evidence you did given a robbery. 
But how could we ever really even guess at Pfbroken window,open front door,missing laptop)? 
Instead of researching the probability of the data you observed, we could try to calculate the 
probabilities of all other possible events that could explain your observations. Since they must sum 
to 1, we could work backward and find P[D). But for the case of this particular evidence, there’s a 
virtually limitless number of possibilities. 

We’re a bit stuck without P[D). In Chapters 6 and 7, where we calculated the probability that a 
customer service rep was male and the probability of choosing different colored LEGO studs, 
respectively, we had plenty of information about P[D ). This allowed us to come up with an exact 
probability of our belief in our hypothesis given what we observed. Without P[D ) we cannot come 
up with a value for P(robbed | broken window,open front door,missing laptop). However, we’re not 
completely lost. 

The good news is that in some cases we don’t need to explicitly know P[D), because we often just 
want to compare hypotheses. In this example, we’ll compare how likely it is that you were robbed 
with another possible explanation. We can do this by looking at the ratio of our unnormalized 
posterior distributions. Because the P[D ) would be a constant, we can safely remove it without 
changing our analysis. 

So, instead of calculating the P[D ), for the remainder of this chapter we’ll develop an alternative 
hypothesis, calculate its posterior, and then compare it to the posterior from our original 
hypothesis. While this means we can’t come up with an exact probability of being robbed as the 
only possible explanation for the evidence you observed, we can still use Bayes’ theorem to play 
detective and investigate other possibilities. 

CONSIDERING ALTERNATIVE HYPOTHESES 

Let’s come up with another hypothesis to compare with our original one. Our new hypothesis 
consists of three events: 

1. A neighborhood kid hit a baseball through the front window. 

2. You left your door unlocked. 

3. You forgot that you brought your laptop to work and it’s still there. 

We’ll refer to each of these explanations simply by its number in our list, and refer to them 
collectively as H 2 so that P[H 2 ) = P(l,2,3). Now we need to solve for the likelihood and prior of this 
data. 

The Likelihood for Our Alternative Hypothesis 

Recall that, for our likelihood, we want to calculate the probability of what you observed given our 
hypothesis, or P[D \ Hz'). Interestingly—and logically, as you’ll see—the likelihood for this 
explanation turns out to be 1: P[D \ H 2 ) = 1 



If all the events in our hypothesis did happen, then your observations of a broken window, 
unlocked door, and missing laptop would be certain. 


The Prior for Our Alternative Hypothesis 

Our prior represents the possibility of all three events happening. This means we need to first work 
out the probability of each of these events and then use the product rule to determine the prior. For 
this example, we’ll assume that each of these possible outcomes is conditionally independent. 

The first part of our hypothesis is that a neighborhood kid hit a baseball through the front window. 
While this is common in movies, I've personally never heard of it happening. I have known far more 
people who have been robbed, though, so let’s say that a baseball being hit through the window is 
half as likely as the probability of getting robbed we used earlier: 

r(i) = —!— 

' 2,000 

The second part of our hypothesis is that you left the door unlocked. This is fairly common; let’s say 
this happens about once a month, so: 

P(2) = — 

v ’ 30 


Finally, let’s look at leaving your laptop at work. While bringing a laptop to work and leaving it 
there might be common, completely forgetting you took it in the first place is less common. Maybe 
this happens about once a year: 

1 

365 

Since we’ve given each of these pieces of Hz a probability, we can now calculate our prior 
probability by applying the product rule: 

1 J_ 1 1 

2,000 X 30 X 365 ” 21,900 ? 000 


p(h 2 )= 



As you can see, the prior probability of all three events happening is extremely low. Now we need a 
posterior for each of our hypotheses to compare. 


The Posterior for Our Alternative Hypothesis 

We know that our likelihood, P[D \ Hz], equals 1, so if our second hypothesis were to be true, we 
would be certain to see our evidence. Without a prior probability in our second hypothesis, it looks 
like the posterior probability for our new hypothesis will be much stronger than it is for our 
original hypothesis that you were robbed (since we aren’t as likely to see the data even if we were 
robbed). We can now see how the prior radically alters our unnormalized posterior probability: 

P(D\ H 2 )xP(H 2 )= lx---=--- 

v 7 v ' 21,900,000 21,900,000 


Now we want to compare our posterior beliefs as well as the strength of our hypotheses with a 
ratio. You'll see that we don’t need a P[D ) to do this. 



COMPARING OUR UNNORMALIZED POSTERIORS 


First, we want to compare the ratio of the two posteriors. A ratio tells us how many times more 
likely one hypothesis is than the other. We’ll define our original hypothesis as Hi, and the ratio 
looks like this: 


P(W, I D) 

P(H 2 | D) 


Next let’s expand this using Bayes’ theorem for each of these. We’ll write Bayes’ theorem as P(H) 
x P[D | H ) x 1 /P(D) to make the formula easier to read in this context: 


P{H l )xP(D\H,)x 


1 


P(D) 


P{H 2 )xP(D\H 2 )x 


P(D) 


Notice that both the numerator and denominator contain 1/P(D), which means we can remove that 
and maintain the ratio. This is why P(D) doesn’t matter when we compare hypotheses. Now we 
have a ratio of the unnormalized posteriors. Because the posterior tells us how strong our belief is, 
this ratio of posteriors tells us how many times better Hi explains our data than H 2 without 
knowing P(D). Let’s cancel out the P(D) and plug in our numbers: 

3 

P(H l )xP(D\H l ) 10,000 

P(H 2 )xP(D\H 2 ) 1 ’ 

21,900,000 

What this means is that Hi explains what we observed 6,570 times better than H 2 . In other words, 
our analysis shows that our original hypothesis (Hi) explains our data much, much better than our 
alternate hypothesis (H 2 ). This also aligns well with our intuition—given the scene you observed, a 
robbery certainly sounds like a more likely assessment. 

We’d like to express this property of the unnormalized posterior mathematically to be able to use it 
for comparison. For that, we use the following version of Bayes’ theorem, where the symbol oc 
means "proportional to”: 


P(H | D) oc P(H) x P(D | H) 


We can read this as: "The posterior—that is, the probability of the hypothesis given the data— 
is proportional to the prior probability of H multiplied by the probability of the data given H." 

This form of Bayes’ theorem is extremely useful whenever we want to compare the probability of 
two ideas but can’t easily calculate P(D). We cannot come up with a meaningful value for the 
probability of our hypothesis in isolation, but we’re still using a version of Bayes’ theorem to 
compare hypotheses. Comparing hypotheses means that we can always see exactly how much 
stronger one explanation of what we’ve observed is than another. 


WRAPPING UP 

This chapter explored how Bayes’ theorem provides a framework for modeling our beliefs about 
the world, given data that we have observed. For Bayesian analysis, Bayes’ theorem consists of 



three major parts: the posterior probability, P[H \ D ); the prior probability, P(7/); and the 
likelihood, P[D \ H). 

The data itself, or P(£>), is notably absent from this list, because we often won’t need it to perform 
our analysis if all we’re worried about is comparing beliefs. 


EXERCISES 


Try answering the following questions to see if you have a solid understanding of the different parts 
of Bayes’ Theorem. The solutions can be found at https://nostarch.com/learnbaves/ . 

1. As mentioned, you might disagree with the original probability assigned to the 
likelihood: 


3 

P (broken window, open front door, missing laptop | robbed) = — 


How much does this change our strength in believing Hi over H 2 1 


2. How unlikely would you have to believe being robbed is—our prior for Hi —in order 

for the ratio of Hi to H 2 to be even? 




BAYESIAN PRIORS AND WORKING WITH PROBABILITY 

DISTRIBUTIONS 



Prior probabilities are the most controversial aspect of Bayes’ theorem, because they’re frequently 
considered subjective. In practice, however, they often demonstrate how to apply vital background 
information to fully reason about an uncertain situation. 

In this chapter, we’ll look at how to use a prior to solve a problem, and at ways to use probability 
distributions to numerically describe our beliefs as a range of possible values rather than single 
values. Using probability distributions instead of single values is useful for two major reasons. 
First, in reality there is often a wide range of possible beliefs we might have and consider. Second, 
representing ranges of probabilities allows us to state our confidence in a set of hypotheses. We 
explored both of these examples when examining the mysterious black box in Chapter 5 . 

C-3P0'S ASTEROID FIELD DOUBTS 

As an example, we'll use one of the most memorable errors in statistical analysis from a scene 
in Star Wars: The Empire Strikes Back. When Han Solo, attempting to evade enemy fighters, flies 
the Millennium Falcon into an asteroid field, the ever-knowledgeable C-3P0 informs Han that 
probability isn’t on his side. C-3P0 says, "Sir, the possibility of successfully navigating an asteroid 
field is approximately 3,720 to 1!" 

"Never tell me the odds!" replies Han. 

Superficially, this is just a fun movie dismissing "boring" data analysis, but there’s actually an 
interesting dilemma here. We the viewers know that Han can pull it off, but we probably also don’t 
disagree with C-3PO’s analysis. Even Han believes it’s dangerous, saying, "They'd have to be crazy 
to follow us." Plus, none of the pursuing TIE fighters make it through, which provides pretty strong 
evidence that C-3PO’s numbers aren’t totally off. 

What C-3PO is missing in his calculations is that Han is a badass! C-3PO isn’t wrong, he’s just 
forgetting to add essential information. The question now is: can we find a way to avoid C-3PO’s 
error without dismissing probability entirely, as Han proposes? To answer this question, we need 
to model both how C-3PO thinks and what we believe about Han, then blend those models using 
Bayes’ theorem. 

We'll start with C-3PO’s reasoning in the next section, and then we’ll capture Han’s badassery. 



DETERMINING C-3P0'S BELIEFS 

C-3P0 isn’t just making up numbers. He’s fluent in over 6 million forms of communication, and that 
takes a lot of data to support, so we can assume that he has actual data to back up his claim of 
"approximately 3,720 to 1." Because C-3PO provides the approximate odds of successfully 
navigating an asteroid field, we know that the data he has gives him only enough information to 
suggest a range of possible rates of success. To represent that range, we need to look at 
a distribution of beliefs regarding the probability of success, rather than a single value representing 
the probability. 

To C-3PO, the only possible outcomes are successfully navigating the asteroid field or not. We’ll 
determine the various possible probabilities of success, given C-3PO’s data, using the beta 
distribution you learned about in Chapter 5 . We’re using the beta distribution because it correctly 
models a range of possible probabilities for an event, given information we have on the rate of 
successes and failures. 

Recall that the beta distribution is parameterized with an a (number of observed successes) and a (3 
(the number of observed failures): 

P(RateOfSuccess | Successes and Failures) = Beta(a,(3) 

This distribution tells us which rates of success are most likely given the data we have. 

To figure out C-3PO’s beliefs, we’ll make some assumptions about where his data comes from. Let’s 
say that C-3PO has records of 2 people surviving the asteroid field, and 7,440 people ending their 
trip in a glorious explosion! Figure 9-1 shows a plot of the probability density function that 
represents C-3PO’s belief in the true rate of success. 



Distribution of C-3PO'$ likelihood of surviving 



Figure 9-1: A beta distribution representing C-3P0’s belief that Han will survive 


For any ordinary pilot entering an asteroid field, this looks bad. In Bayesian terms, C-3P0’s estimate 
of the true rate of success given observed data, 3,720:1, is the likelihood, which we discussed 
in Chapter 8 . Next, we need to determine our prior. 

ACCOUNTING FOR HAN'S BADASSERY 

The problem with C-3PO’s analysis is that his data is on all pilots, but Han is far from your average 
pilot. If we can’t put a number to Han’s badassery, then our analysis is broken—not just because 
Han makes it through the asteroid field, but because we believe he’s going to. Statistics is a tool that 
aids and organizes our reasoning and beliefs about the world. If our statistical analysis not only 
contradicts our reasoning and beliefs, but also fails to change them, then something is wrong with 
our analysis. 

We have a prior belief that Han will make it through the asteroid field, because Han has survived 
every improbable situation so far. What makes Han Solo legendary is that no matter how unlikely 
survival seems, he always succeeds! 

The prior probability is often very controversial for data analysts outside of Bayesian analysis. 

Many people feel that just "making up" a prior is not objective. But this scene is an object chapter in 
why dismissing our prior beliefs is even more absurd. Imagine watching Empire for the first time, 















getting to this scene, and having a friend sincerely tell you, "Welp, Han is dead now." There’s not a 
chance you'd think it was true. Remember that C-3PO isn’t entirely wrong about how unlikely 
survival is: if your friend said, "Welp, those TIE fighters are dead now," you would likely chuckle in 
agreement. 

Right now, we have many reasons for believing Han will survive, but no numbers to back up that 
belief. Let’s try to put something together. 

We'll start with some sort of upper bound on Han’s badassery. If we believed Han absolutely could 
not die, the movie would become predictable and boring. At the other end, our belief that Han will 
succeed is stronger than C-3PO’s belief that he won’t, so let’s say that our belief that Han will 
survive is 20,000 to 1. 

Figure 9-2 shows the distribution for our prior probability that Han will make it. 

Distribution of our prior belief of Han Solo surviving 



Probability of success 

Figure 9-2: The beta distribution representing the range of our prior belief in Han Solo’s survival 

This is another beta distribution, which we use for two reasons. First, our beliefs are very 
approximate, so we need to concede a variable rate of survival. Second, a beta distribution will 
make future calculations much easier. 

Now, with our likelihood and prior in hand, we can calculate our posterior probability in the next 
section. 
















CREATING SUSPENSE WITH A POSTERIOR 

We have now established what C-3PO believes (the likelihood), and we’ve modeled our own beliefs 
in Han (the prior), but we need a way to combine these. By combining beliefs, we create 
our posterior distribution. In this case, the posterior models our sense of suspense upon learning the 
likelihood from C-3PO: the purpose of C-3PO’s analysis is in part to poke fun at his analytical 
thinking, but also to create a sense of real danger. Our prior alone would leave us completely 
unconcerned for Han, but when we adjust it based on C-3PO’s data, we develop a new belief that 
accounts for the real danger. 

The formula for the posterior is actually very simple and intuitive. Given that we have only a 
likelihood and a prior, we can use the proportional form of Bayes’ theorem that we discussed in the 
previous chapter: 

Posterior oc Likelihood x Prior 


Remember, using this proportional form of Bayes’ theorem means that our posterior distribution 
doesn’t necessarily sum to 1. But we’re lucky because there’s an easy way to combine beta 
distributions that will give us a normalized posterior when all we have is the likelihood and the 
prior. Combining our two beta distributions—one representing C-3P0’s data (the likelihood) and 
the other our prior belief in Han’s ability to survive anything (our prior)—in this way is remarkably 
easy: 


Beta (a posterior^ (3 posterior ) — Beta((Xlikelihood "T CXprior, (Likelihood "T (Bprior) 


We just add the alphas for our prior and posterior and the betas for our prior and posterior, and we 
arrive at a normalized posterior. Because this is so simple, working with the beta distribution is 
very convenient for Bayesian statistics. To determine our posterior for Han making it through the 
asteroid field, we can perform this simple calculation: 


Beta(20002,7401) = Beta(2 + 20000, 7400 + 1) 

Now we can visualize our new distribution for our data. Figure 9-3 plots our final posterior belief. 



Distribution of our prior belief Beta(2+20000,7400+l) 



Figure 9-3: Combining our likelihood with our prior gives us a more intriguing posterior. 


By combining the C-3P0 belief with our Han-is-a-badass belief, we find that we have a far more 
reasonable position. Our posterior belief is a roughly 73 percent chance of survival, which means 
we still think Han has a good shot of making it, but we’re also still in suspense. 

What’s really useful is that we don’t simply have a raw probability for how likely Han is to make it, 
but rather a full distribution of possible beliefs. For many examples in the book, we’ve stuck to 
simply using a single value for our probabilities, but in practice, using a full distribution helps us to 
be flexible with the strength of our beliefs. 

WRAPPING UP 

In this chapter, you learned how important background information is to analyzing the data in front 
of you. C-3P0’s data provided us with a likelihood function that didn’t match up with our prior 
understanding of Han’s abilities. Rather than simply dismissing C-3P0, as Han famously does, we 
combine C-3P0’s likelihood with our prior to come up with an adjusted belief about the possibility 
of Han’s success. In Star Wars: The Empire Strikes Back, this uncertainty is vital for the tension the 
scene creates. If we completely believe C-3P0’s data or our own prior, we would either be nearly 
certain that Han would die or be nearly certain that he would survive without trouble. 







You also saw that you can use probability distributions, rather than a single probability, to express 
a range of possible beliefs. In later chapters in this book, you’ll look at these distributions in more 
detail to explore the uncertainty of your beliefs in a more nuanced way. 

EXERCISES 

Try answering the following questions to see if you understand how to combine prior probability 
and likelihood distributions to come up with an accurate posterior distribution; solutions to the 
questions can be found at https://nostarch.com/learnbaves/ . 

1. A friend finds a coin on the ground, flips it, and gets six heads in a row and then one 
tails. Give the beta distribution that describes this. Use integration to determine the 
probability that the true rate of flipping heads is between 0.4 and 0.6, reflecting that the coin 
is reasonably fair. 

2. Come up with a prior probability that the coin is fair. Use a beta distribution such that 
there is at least a 95 percent chance that the true rate of flipping heads is between 0.4 and 
0 . 6 . 

3. Now see how many more heads (with no more tails) it would take to convince you 
that there is a reasonable chance that the coin is not fair. In this case, let’s say that this 
means that our belief in the rate of the coin being between 0.4 and 0.6 drops below 0.5. 



PART III 

PARAMETER ESTIMATION 
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INTRODUCTION TO AVERAGING AND PARAMETER 

ESTIMATION 



This chapter introduces you to parameter estimation, an essential part of statistical inference where 
we use our data to guess the value of an unknown variable. For example, we might want to estimate 
the probability of a visitor on a web page making a purchase, the number of jelly beans in a jar at a 
carnival, or the location and momentum of a particle. In all of these cases, we have an unknown 
value we want to estimate, and we can use information we have observed to make a guess. We refer 
to these unknown values as parameters, and the process of making the best guess about these 
parameters as parameter estimation. 

We’ll focus on averaging, which is the most basic form of parameter estimation. Nearly everyone 
understands that taking an average of a set of observations is the best way to estimate a true value, 
but few people really stop to ask why this works—if it really does at all. We need to prove that we 
can trust averaging, because in later chapters, we build it into more complex forms of parameter 
estimation. 

ESTIMATING SNOWFALL 

Imagine there was a heavy snow last night and you’d like to figure out exactly how much snow fell, 
in inches, in your yard. Unfortunately, you don’t have a snow gauge that will give you an accurate 
measurement. Looking outside, you see that the wind has blown the snow around a bit overnight, 
meaning it isn’t uniformly smooth. You decide to use a ruler to measure the depth at seven roughly 
random locations in your yard. You come up with the following measurements (in inches): 

6.2, 4.5, 5.7, 7.6, 5.3, 8.0, 6.9 

The snow has clearly shifted around quite a bit and your yard isn’t perfectly level either, so your 
measurements are all pretty different. Given that, how can we use these measurements to make a 
good guess as to the actual snowfall? 

This simple problem is a great example case for parameter estimation. The parameter we’re 
estimating is the actual depth of the snowfall from the previous night. Note that, since the wind has 
blown the snow around and you don’t have a snow gauge, we can never know the exactamount of 
snow that fell. Instead, we have a collection of data that we can combine using probability, to 
determine the contribution of each observation to our estimate, in order to help us make the best 
possible guess. 


Averaging Measurements to Minimize Error 

You first instinct is probably to average these measurements. In grade school, we learn to average 
elements by adding them up and dividing the sum by the total number of elements. So if there 
are n measurements, each labeled as m, where / is the zth measurement, we get: 

7YL + nu, + ;« o . , . ni 

average = —------ 

n 

If we plug in our data, we get the following solution: 

(6.2 + 4.5 + 5.7 + 7.6 + 5.3 + 8.0 + 6.9) 

^- 1 = 6.31 

7 

So, given our seven observations, our best guess is that about 6.31 inches of snow fell. 

Averaging is a technique embedded in our minds from childhood, so its application to this problem 
seems obvious, but in actuality, it’s hard to reason about why it works and what it has to do with 
probability. After all, each of our measurements is different, and all of them are likely different from 
the true value of the snow that fell. For many centuries, even great mathematicians feared that 
averaging data compounds all of these erroneous measurements, making for a very inaccurate 
estimate. 

When we estimate parameters, it’s vital that we understand why we’re making a decision; 
otherwise, we risk using an estimate that may be unintentionally biased or otherwise wrong in a 
systematic way. One error commonly made in statistics is to blindly apply procedures without 
understanding them, which frequently leads to applying the wrong solution to a problem. 
Probability is our tool for reasoning about uncertainty, and parameter estimation is perhaps the 
most common process for dealing with uncertainty. Let’s dive a little deeper into averaging to see if 
we can become more confident that it is the correct path. 

Solving a Simplified Version of Our Problem 

Let’s simplify our snowfall problem a bit: rather than imagining all possible depths of snow, 
imagine the snow falling into nice, uniform blocks so that your yard forms a simple two- 
dimensional grid. Figure 10-1 shows this perfectly even, 6-inch-deep snowfall, visualized from the 
side (rather than as a bird’s-eye view). 



A simplified view of uniform snowfall 
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Figure 10-1: Visualizing a perfectly uniform, discrete snowfall 


This is the perfect scenario. We don’t have an unlimited number of possible measurements; instead, 
we sample our six possible locations, and each location has only one possible measurement—6 
inches. Obviously, averaging works in this case, because no matter how we sample from this data, 
our answer will always be 6 inches. 

Compare that to Figure 10-2. which illustrates the data when we include the windblown snow 
against the left side of your house. 



Non-uniform snowfall 
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Figure 10-2: Representing the snow shifted by the wind 


Now, rather than having a nice, smooth surface, we’ve introduced some uncertainty into our 
problem. Of course, we’re cheating because we can easily count each block of snow and know 
exactly how much snow has fallen, but we can use this example to explore how we would reason 
about an uncertain situation. Let’s start investigating our problem by measuring each of the blocks 
in your yard: 


















8, 7, 6, 6, 5, 4 


Next, we want to associate some probabilities with each value. Since we’re cheating and know the 
true value of the snowfall is 6 inches, we'll also record the difference between the observation and 
the true value, known as the error value (see Table 10-1 1. 


Table 10-1: Our Observations, and Their Frequencies and Differences from Truth 


Observation 

Difference from truth 

Probability 

8 

2 

1/6 

7 

1 

1/6 

6 

0 

2/6 

5 

-1 

1/6 

4 

-2 

1/6 


Looking at the distance from the true measurement for each possible observation, we can see that 
the probability of overestimating by a certain value is balanced out by the probability of an 
undervalued measurement. For example, there is a 1/6 probability of picking a measurement that 
is 2 inches higher than the true value, but there’s an equally probable chance of picking a 
measurement that is 2 inches lower than the true measurement. This leads us to our first key 
insight into why averaging works: errors in measurement tend to cancel each other out. 

Solving a More Extreme Case 

With such a smooth distribution of errors, the previous scenario might not have convinced you that 
errors cancel out in more complex situations. To demonstrate how this effect still holds in other 
cases, let’s look at a much more extreme example. Suppose the wind has blown 21 inches of snow to 
one of the six squares and left only 3 inches at each of the remaining squares, as shown in Figure 
10-3. 
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Figure 10-3: A more extreme case of wind shifting the snow 

Now we have a very different distribution of snowfall. For starters, unlike the preceding example, 
none of the values we can sample from have the true level of snowfall. Also, our errors are no 
longer nicely distributed—we have a bunch of lower-than-anticipated measurements and one 
extremely high measurement. Table 10-2 shows the possible measurements, the difference from 
the true value, and the probability of each measurement. 


Table 10-2: Observations, Differences, and Probabilities for Our Extreme Example 


Observation 

Difference from truth 

Probability 

21 

15 

1/6 

3 

-3 

5/6 


We obviously can’t just match up one observation’s error value with another’s and have them 
cancel out. However, we can use probability to show that even in this extreme distribution, our 
errors still cancel each other out. We can do this by thinking of each error measurement as a value 
that’s being voted on by our data. The probability of each error observed is how strongly we believe 
in that error. When we want to combine our observations, we can consider the probability of the 
observation as a value representing the strength of its vote toward the final estimate. In this case, 
the error of-3 inches is five times more likely than the error of 15 inches, so -3 gets weighted more 
heavily. So, if we were taking a vote, -3 would get five votes, whereas 15 would only get one vote. 
We combine all of the votes by multiplying each value by its probability and adding them together, 
giving us a weighted sum. In the extreme case where all the values are the same, we would just have 
1 multiplied by the value observed and the result would just be that value. In our example, we get: 

5 1 , 

— x-3 + — xl5 = 0 

6 6 

The errors in each observation cancel out to 0! So, once again, we find that it doesn’t matter if none 
of the possible values is a true measurement or if the distribution of errors is uneven. When we 
weight our observations by our belief in that observation, the errors tend to cancel each other out. 

Estimating the True Value with Weighted Probabilities 

We are now fairly confident that errors from our true measurements cancel out. But we still have a 
problem: we’ve been working with the errors from the true observation, but to use these we need 
to know the true value. When we don’t know the true value, all we have to work with are our 
observations, so we need to see if the errors still cancel out when we have the weighted sum of our 
original observations. 

To demonstrate that our method works, we need some "unknown" true values. Let’s start with the 
following errors: 

2 , 1 ,- 1 , -2 

Since the true measurement is unknown, we’ll represent it with the variable t, then add the error. 
Now we can weight each of these observations by its probability: 

I( 2+ i) + l(l + 0 + l(-l+0 + j(-2 + 0 



All we’ve done here is add our error to our constant value t, which represents our true measure, 
then weight each of the results by its probability. We’re doing this to see if we can still get our 
errors to cancel out and leave us with just the value t. If so, we can expect errors to cancel out even 
when we’re just averaging raw observations. 

Our next step is to apply the probability weight to the values in our terms to get one long 
summation: 

21 11 -11 -21 
- + -/ + - + -/ + — + -t + — + —/ = 0 + / 

4444 4 4 4 4 

Now if we reorder these terms so that all the errors are together, we can see that our errors will 
still cancel out, and the weighted t value sums up to just t, our unknown true value: 


This shows that even when we define our measurements as an unknown true value t and add some 
error value, the errors still cancel out! We are left with just the t in the end. Even when we don’t 
know what our true measurement or true error is, when we average our values the errors tend to 
cancel out. 

In practice, we typically can’t sample the entire space of possible measurements, but the more 
samples we have, the more the errors are going to cancel out and, in general, the closer our 
estimate will be to the true value. 


-t + -t + -t + -t] = 0+t 
4 4 4 4 



Defining Expectation, Mean, and Averaging 

What we’ve arrived at here is formally called the expectation or mean of our data. It is simply the 
sum of each value weighted by its probability. If we denote each of our measurements as x, and the 
probability of each measurement as p,, we mathematically define the mean—which is generally 
represented by p (the lowercase Greek letter mu)—as follows: 

n 

v = X M 

1 

To be clear, this is the exact same calculation as the averaging we learned in grade school, just with 
notation to make the use of probability more explicit. As an example, to average four numbers, in 
school we wrote it as: 

*1 + A 2 + x 3 + x 4 

4 


which is identical to writing: 

1111 


— X H- X 9 H- X, H- X d 

4 4 4 4 


or we can just say p, = 1/4 and write it as: 
4 

r = X A*. 

1 


So even though the mean is really just the average nearly everyone is familiar with, by building it up 
from the principles of probability, we see why averaging our data works. No matter how the errors 
are distributed, the probability of errors at one extreme is canceled out by probabilities at the other 



extreme. As we take more samples, the averages are more likely to cancel out and we start to 
approach the true measurement we’re looking for. 

MEANS FOR MEASUREMENT VS. MEANS FOR SUMMARY 

We’ve been using our mean to estimate a true measurement from a distribution of observations 
with some added error. But the mean is often used as a way to summarize a set of data. For 
example, we might refer to things like: 

• The mean height of a person 

• The average price of a home 

• The average age of a student 

In all of these cases, we aren’t using mean as a parameter estimate for a single true measurement; 
instead, we’re summarizing the properties of a population. To be precise, we’re estimating a 
parameter of some abstract property of these populations that may not even be real. Even though 
mean is a very simple and well-known parameter estimate, it can be easily abused and lead to 
strange results. 

A fundamental question you should always ask yourself when averaging data is: "What exactly am I 
trying to measure and what does this value really mean?" For our snowfall example, the answer is 
easy: we’re trying to estimate how much snow actually fell last night before the wind blew it 
around. However, when we’re measuring the "average height," the answer is less clear. There is no 
such thing as an average person, and the differences in heights we observe aren’t errors—they’re 
truly different heights. A person isn’t 5’5" because part of their height drifted onto a 6’3" person! 

If you were building an amusement park and wanted to know what height restrictions to put on a 
roller coaster so that at least half of all visitors could ride it, then you have a real value you are 
trying to measure. However, in that case, the mean suddenly becomes less helpful. A better 
measurement to estimate is the probability that someone entering your park will be taller than x, 
where x is the minimum height to ride a roller coaster. 

All of the claims I’ve made in this chapter assume we are talking about trying to measure a specific 
value and using the average to cancel the errors out. That is, we’re using averaging as a form of 
parameter estimation, where our parameter is an actual value that we simply can never know. 
While averaging can also be useful to summarize large sets of data, we can no longer use the 
intuition of "errors canceling out” because the variation in the data is genuine, meaningful variation 
and not error in a measurement. 

WRAPPING UP 

In this chapter, you learned that you can trust your intuition about averaging out your 
measurements in order to make a best estimate of an unknown value. This is true because errors 
tend to cancel out. We can formalize this notion of averaging into the idea of the expectation or 
mean. When we calculate the mean, we are weighting all of our observations by the probability of 
observing them. Finally, even though averaging is a simple tool to understand, we should always 
identify and understand what we’re trying to determine by averaging; otherwise, our results may 
end up being invalid. 



EXERCISES 

Try answering the following questions to see how well you understand averaging to estimate an 
unknown measurement. The solutions can be found at https://nostarch.com/Iearnbaves/ . 

1. It’s possible to get errors that don’t quite cancel out the way we want. In the 
Fahrenheit temperature scale, 98.6 degrees is the normal body temperature and 100.4 
degrees is the typical threshold for a fever. Say you are taking care of a child that feels warm 
and seems sick, but you take repeated readings from the thermometer and they all read 
between 99.5 and 100.0 degrees: warm, but not quite a fever. You try the thermometer 
yourself and get several readings between 97.5 and 98. What could be wrong with the 
thermometer? 

2. Given that you feel healthy and have traditionally had a very consistently normal 
temperature, how could you alter the measurements 100, 99.5, 99.6, and 100.2 to estimate if 
the child has a fever? 



11 

MEASURING THE SPREAD OF OUR DATA 



In this chapter, you'll learn three different methods—mean absolute deviation, variance, and 
standard deviation—for quantifying the spread, or the different extremes, of your observations. 

In the previous chapter, you learned that the mean is the best way to guess the value of an 
unknown measurement, and that the more spread out our observations, the more uncertain we are 
about our estimate of the mean. As an example, if we’re trying to figure out the location of a 
collision between two cars based only on the spread of the remaining debris after the cars have 
been towed away, then the more spread out the debris, the less sure we’d be of where precisely the 
two cars collided. 

Because the spread of our observations is related to the uncertainty in the measurement, we need 
to be able to quantify it so we can make probabilistic statements about our estimates (which you'll 
learn how to do in the next chapter). 

DROPPING COINS IN A WELL 

Say you and a friend are wandering around the woods and stumble across a strange-looking old 
well. You peer inside and see that it seems to have no bottom. To test it, you pull a coin from your 
pocket and drop it in, and sure enough, after a few seconds you hear a splash. From this, you 
conclude that the well is deep, but not bottomless. 

With the supernatural discounted, you and your friend are now equally curious as to how deep the 
well actually is. To gather more data, you grab five more coins from your pocket and drop them in, 
getting the following measurements in seconds: 

3.02,2.95,2.98,3.08,2.97 

As expected, you find some variation in your results; this is primarily due to the challenge of 
making sure you drop the coin from the same height and time then record the splash correctly. 
Next, your friend wants to try his hand at getting some measurements. Rather than picking five 
similarly sized coins, he grabs a wider assortment of objects, from small pebbles to twigs. Dropping 
them in the well, your friend gets the following measurements: 

3.31,2.16, 3.02,3.71,2.80 

Both of these samples have a mean (p) of about 3 seconds, but your measurements and your 
friend’s measurements are spread to different degrees. Our aim in this chapter is to come up with a 


way to quantify the difference between the spread of your measurements and the spread of your 
friend’s. We'll use this result in the next chapter to determine the probability of certain ranges of 
values for our estimate. 

For the rest of this chapter we’ll indicate when we’re talking about the first group of values (your 
observations) with the variable a and the second group (your friend’s observations) with the 
variable b. For each group, each observation is denoted with a subscript; for example, a 2 is the 
second observation from group a. 

FINDING THE MEAN ABSOLUTE DEVIATION 

We'll begin by measuring the spread of each observation from the mean (p). The mean for 
both aand b is 3. Since p is our best estimate for the true value, it makes sense to start quantifying 
the difference in the two spreads by measuring the distance between the mean and each of the 
values. Table 11-1 displays each observation and its distance from the mean. 


Table 11-1: Your and Your Friend’s Observations and Their Distances from the Mean 


Observation 

Difference from mean 

Group a 

3.02 

0.02 

2.95 

-0.05 

2.98 

-0.02 

3.08 

0.08 

2.97 

-0.03 

Group b 

3.31 

0.31 

2.16 

-0.84 

3.02 

0.02 

3.71 

0.71 

2.80 

-0.16 




The distance from the mean is 
and is unknown in this case. 

different than the error value, which is the distance from the true value 






A first guess at how to quantify the difference between the two spreads might be to just sum up 
their differences from the mean. However, when we try this out, we find that the sum of the 
differences for both sets of observations is exactly the same, which is odd given the notable 
difference in the spread of the two data sets: 

5 5 

2>-H. = 0 =0 

i -1 1-1 

The reason we can’t simply sum the differences from the mean is related to why the mean works in 
the first place: as we know from Chapter 10. the errors tend to cancel each other out. What we need 
is a mathematical method that makes sure our differences don’t cancel out without affecting the 
validity of our measurements. 

The reason the differences cancel out is that some are negative and some are positive. So, if we 
convert all the differences to positives, we can eliminate this problem without invalidating the 
values. 

The most obvious way to do this is to take the absolute value of the differences; this is the number’s 
distance from 0, so the absolute value of 4 is 4, and the absolute value of -4 is also 4. This gives us 
the positive version of our negative numbers without actually changing them. To represent an 
absolute value, we enclose the value in vertical lines, as in | —6 | = | 6 | = 6. 

If we take the absolute value of the differences in Table 11-1 and use those in our calculation 
instead, we get a result we can work with: 

i>,-H„| = 0.2 2>.-^l = 2 ' 08 

1 1 

Try working this out by hand, and you should get the same results. This is a more useful approach 
for our particular situation, but it applies only when the two sample groups are the same size. 
Imagine we had 40 more observations for group a —let’s say 20 observations of 2.9 and 20 of 3.1. 
Even with these additional observations, the data in group a seems less spread out than the data in 
group b, but the absolute sum of group a is now 85.19 simply because it has more observations! 

To correct for this, we can normalize our values by dividing by the total number of observations. 
Rather than dividing, though, we’ll just multiply by 1 over the total, which is known as multiplying 
the reciprocal and looks like this: 

7*i>,-H„| = 0.04 Jx 2.08 

J) 1 J i 

Now we have a measurement of the spread that isn’t dependent on the sample size! The 
generalization of this approach is as follows: 

MADM" ix Zk-H 

Here we’ve calculated the mean of the absolute differences between our observations and the 
mean. This means that for group a the average observation is 0.04 from the mean, and for 
group b it's about 0.416 seconds from the mean. We call the result of this formula the mean absolute 
deviation (MAD). The MAD is a very useful and intuitive measure of how spread out your 
observations are. Given that group a has a MAD of 0.04 and group b around 0.4, we can now say 
that group b is about 10 times as spread out as group a. 



FINDING THE VARIANCE 

Another way to mathematically make all of our differences positive without invalidating the data is 
to square them: (x, - p) 2 . This method has at least two benefits over using MAD. 

The first benefit is a bit academic: squaring values is much easier to work with mathematically than 
taking their absolute value. In this book, we won’t take advantage of this directly, but for 
mathematicians, the absolute value function can be a bit annoying in practice. 

The second, and more practical, reason is that squaring results in having an exponential penalty, 
meaning measurements very far away from the mean are penalized much more. In other words, 
small differences aren’t nearly as important as big ones, as we would feel intuitively. If someone 
scheduled your meeting in the wrong room, for example, you wouldn’t be too upset if you ended up 
next door to the right room, but you’d almost certainly be upset if you were sent to an office on the 
other side of the country. 

If we substitute the absolute value for the squared difference, we get the following: 

Var(x) = -x^(x -jt ) 2 

n i 

This formula, which has a very special place in the study of probability, is called the variance. Notice 
that the equation for variance is exactly the same as MAD except that the absolute value function in 
MAD has been replaced with squaring. Because it has nicer mathematical properties, variance is 
used much more frequently in the study of probability than MAD. We can see how different our 
results look when we calculate their variance: 

Var(group a) = 0.002, Var(group b) - 0.269 

Because we’re squaring, however, we no longer have an intuitive understanding of what the results 
of variance mean. MAD gave us an intuitive definition: this is the average distance from the mean. 
Variance, on the other hand, says: this is the average squared difference. Recall that when we used 
MAD, group b was about 10 times more spread out than group a, but in the case of variance, 
group b is now 100 times more spread out! 

FINDING THE STANDARD DEVIATION 

While in theory variance has many properties that make it useful, in practice it can be hard to 
interpret the results. It’s difficult for humans to think about what a difference of 0.002 seconds 
squared means. As we’ve mentioned, the great thing about MAD is that the result maps quite well to 
our intuition. If the MAD of group b is 0.4, that means that the average distance between any given 
observation and the mean is literally 0.4 seconds. But averaging over squared differences doesn’t 
allow us to reason about a result as nicely. 

To fix this, we can take the square root of the variance in order to scale it back into a number that 
works with our intuition a bit better. The square root of a variance is called the standard 
deviationand is represented by the lowercase Greek letter sigma (a). It is defined as follows: 

a = 

The formula for standard deviation isn’t as scary as it might seem at first. Looking at all of the 
different parts, given that our goal is to numerically represent how spread out our data is, we can 
see that: 




1. We want the difference between our data and the mean, x, - p. 

2. We need to convert negative numbers to positives, so we take the square, (x, - p) 2 . 

3. We need to add up all the differences: 

X ,"(*^) 2 

4. We don’t want the sum to be affected by the number of observations, so we normalize 
it with 1/n. 

5. Finally, we take the square root of everything so that the numbers are closer to what 
they would be if we used the more intuitive absolute distance. 

If we look at the standard deviation for our two groups, we can see that it’s very similar to the MAD: 

a(group a) = 0.046, a(group b ) = 0.519 

The standard deviation is a happy medium between the intuitiveness of MAD and the mathematical 
ease of variance. Notice that, just like with MAD, the difference in the spread between b and a is a 
factor of 10. The standard deviation is so useful and ubiquitous that, in most of the literature on 
probability and statistics, variance is defined simply as a 2 , or sigma squared! 

So we now have three different ways of measuring the spread of our data. We can see the results 
in Table 11-2. 


Table 11-2: Measurements of Spread by Method 


Method of measuring spread 

Group a 

Group b 

Mean absolute deviations 

0.040 

0.416 

Variance 

0.002 

0.269 

Standard deviation 

0.046 

0.519 


None of these methods for measuring spread is more correct than any other. By far the most 
commonly used value is the standard deviation, because we can use it, together with the mean, to 
define a normal distribution, which in turn allows us to define explicit probabilities to possible true 
values of our measurements. In the next chapter, we'll take a look at the normal distribution and 
see how it can help us understand our level of confidence in our measurements. 

WRAPPING UP 

In this chapter, you learned three methods for quantifying the spread of a group of observations. 
The most intuitive measurement of the spread of values is the mean absolute deviation (MAD), 
which is the average distance of each observation from the mean. While intuitive, MAD isn’t as 
useful mathematically as the other options. 

The mathematically preferred method is the variance, which is the squared difference of our 
observations. But when we calculate the variance, we lose the intuitive feel for what our calculation 
means. 

Our third option is to use the standard deviation, which is the square root of the variance. The 
standard deviation is mathematically useful and also gives us results that are reasonably intuitive. 



EXERCISES 

Try answering the following questions to see how well you understand these different methods of 
measuring the spread of data. The solutions can be found at https://nostarch.com/learnbayes/ . 

1. One of the benefits of variance is that squaring the differences makes the penalties 
exponential. Give some examples of when this would be a useful property. 

2. Calculate the mean, variance, and standard deviation for the following values: 1, 2, 3, 
4, 5, 6, 7, 8, 9,10. 
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THE NORMAL DISTRIBUTION 



In the previous two chapters, you learned about two very important concepts: mean (p), which 
allows us to estimate a measurement from various observations, and standard deviation (a), which 
allows us to measure the spread of our observations. 

On its own, each concept is useful, but together, they are even more powerful: we can use them as 
parameters for the most famous probability distribution of all, the normal distribution. 

In this chapter you'll learn how to use the normal distribution to determine an exact probability for 
your degree of certainty about one estimate proving true compared to others. The true goal of 
parameter estimation isn’t simply to estimate a value, but rather to assign a probability for 
a range of possible values. This allows us to perform more sophisticated reasoning with uncertain 
values. 

We established in the preceding chapter that the mean is a solid method of estimating an unknown 
value based on existing data, and that the standard deviation can be used to measure the spread of 
that data. By measuring the spread of our observations, we can determine how confidently we 
believe in our mean. It makes sense that the more spread out our observations, the less sure we are 
in our mean. The normal distribution allows us to precisely quantify how certain we are in various 
beliefs when taking our observations into account. 

MEASURING FUSES FOR DASTARDLY DEEDS 

Imagine a mustachioed cartoon villain wants to set off a bomb to blow a hole in a bank vault. 
Unfortunately, he has only one bomb, and it’s rather large. He knows that if he gets 200 feet away 
from the bomb, he can escape to safety. It takes him 18 seconds to make it that far. If he’s any closer 
to the bomb, he risks death. 

Although the villain has only one bomb, he has six fuses of equal size, so he decides to test out five 
of the six fuses, saving the last one for the bomb. The fuses are all the same size and should take the 
same amount of time to burn through. He sets off each fuse and measures how long it takes to burn 
through to make sure he has the 18 seconds he needs to get away. Of course, being in a rush leads 
to some inconsistent measurements. Here are the times he recorded (in seconds) for each fuse to 
burn through: 19, 22, 20,19, 23. 

So far so good: none of the fuses takes less than 18 seconds to burn. Calculating the mean gives us p 
= 20.6, and calculating the standard deviation gives us o - 1.62. 

But now we want to determine a concrete probability for how likely it is that, given the data we 
have observed, a fuse will go off in less than 18 seconds. Since our villain values his life even more 


than the money, he wants to be 99.9 percent sure he'll survive the blast, or he won’t attempt the 
heist. 

In Chapter 10. you learned that the mean is a good estimate for the true value given a set of 
measurements, but we haven’t yet come up with any way to express how strongly we believe this 
value to be true. 

In Chapter 11. you learned that you can quantify how spread out your observations are by 
calculating the standard deviation. It seems rational that this might also help us figure out how 
likely the alternatives to our mean might be. For example, suppose you drop a glass on the floor and 
it shatters. When you’re cleaning up, you might search adjacent rooms based on how dispersed the 
pieces of glass are. If, as shown in Figure 12-1. the pieces are very close together, you would feel 
more confident that you don’t need to check for glass in the next room. 



Figure 12-1: When the broken pieces are closer together, you’re more sure of where to clean up. 

However, if the glass pieces are widely dispersed, as in Figure 12-2, you’ll likely want to sweep 
around the entrance of the next room, even if you don’t immediately see broken glass there. 
Likewise, if the villain’s fuse timings are very spread out, even if he didn’t observe any fuses lasting 
less than 18 seconds, it’s possible that the real fuse could still burn through in less than 18 seconds. 



Figure 12-2: When the pieces are spread out, you're less sure of where they might be. 

When observations are scattered visually, we intuitively feel that there might be other observations 
at the extreme limits of what we can see. We are also less confident in exactly where the center is. 

In the glass example, it’s harder to be sure of where the glass fell if you weren’t there to witness the 
fall and the glass fragments are dispersed widely. 

We can quantify this intuition with the most studied and well-known probability distribution: the 
normal distribution. 

THE NORMAL DISTRIBUTION 

The normal distribution is a continuous probability distribution (like the beta distribution 
in Chapter 5 ] that best describes the strength of possible beliefs in the value of an uncertain 
measurement, given a known mean and standard deviation. It takes p and a (the mean and 
standard deviation, respectively) as its only two parameters. A normal distribution with p = 0 and a 
= 1 has a bell shape, as shown in Figure 12-3 . 


The normal distribution with a mean of 0 and a standard deviation of 1 



As you can see, the center of the normal distribution is its mean. The width of a normal distribution 
is determined by its standard deviation. Figures 12-4 and 12-5 show normal distributions with p = 
0 and a = 0.5 and 2, respectively. 

















Normal distribution with a standard deviation of 0.5 



Figure 12-4: A normal distribution with g = 0 and o = 0.5 













Normal distribution with a standard deviation of 2 



Figure 12-5: A normal distribution with g - 0 and a -2 

As the standard deviation shrinks, so does the width of the normal distribution. 

The normal distribution, as we’ve discussed, reflects how strongly we believe in our mean. So, if our 
observations are more scattered, we believe in a wider range of possible values and have less 
confidence in the central mean. Conversely, if all of our observations are more or less the same 
(meaning a small a), we believe our estimate is pretty accurate. 

When the only thing we know about a problem is the mean and standard deviation of the data we 
have observed, the normal distribution is the most honest representation of our state of beliefs. 

SOLVING THE FUSE PROBLEM 

Going back to our original problem, we have a normal distribution with p = 20.6 and a = 1.62. We 
don't really know anything else about the properties of the fuses beyond the recorded burn times, 
so we can model the data with a normal distribution using the observed mean and standard 
deviation (see Figure 12-6 T 













Normal distribution representing our fuse measurements 



Value 

Figure 12-6: A normal distribution with g - 20.6 and o = 1.62 


The question we want to answer is: what is the probability, given the data observed, that the fuse 
will run for 18 seconds or less? To solve this problem, we need to use the probability density 
function (PDF), a concept you first learned about in Chapter 5 . The PDF for the normal distribution 
is: 


iV(|T,o) = 
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And to get the probability, we need to integrate this function over values less than 18: 

£%= 20.6,o = 1.62) 


You can imagine integration as simply taking the area under the curve for the region you’re 
interested in, as shown in Figure 12-7 . 





Area representing fuse lengths less than or equal to 1 8 seconds 



Value 

Figure 12-7: The area under the curve that we’re interested in 

The area of the shaded region represents the probability of the fuse lasting 18 seconds or less given 
the observations. Notice that even though none of the observed values was less than 18, because of 
the spread of the observations, the normal distribution in Figure 12-6 shows that a value of 18 or 
less is still possible. By integrating over all values less than 18, we can calculate the probability that 
the fuse will not last as long as our villain needs it to. 

Integrating this function by hand is not an easy task. Thankfully, we have R to do the integration for 
us. 

Before we do this, though, we need to determine what number to start integrating from. The 
normal distribution is defined on the range of all possible values from negative infinity (-oo) to 
infinity (oo). So in theory what we want is: 

P (fuse time < 18) = (* iV(|T,a) 
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But obviously we cannot integrate our function from negative infinity on a computer! Luckily, as 
you can see in Figures 12-6 and 12-7. the probability density function becomes an incredibly small 
value very quickly. We can see that the line in the PDF is nearly flat at 10, meaning there is virtually 
no probability in this region, so we can just integrate from 10 to 18. We could also choose a lower 
value, like 0, but because there’s effectively no probability in this region, it won’t change our result 
in any meaningful way. In the next section, we’ll discuss a heuristic that makes choosing a lower or 
upper bound easier. 











We’ll integrate this function using R’s integrate!) function and the dnorm() function (which is just R’s 
function for the normal distribution PDF), calculating the PDF of the normal distribution as follows: 


integrate(function(x) dnorm(x,mean=20.6,sd=l.62), 10,18) 

0.05425369 with absolute error < 3e-ll 


Rounding the value, we can see that P(fuse time < 18) = 0.05, telling us there is a 5 percent chance 
that the fuse will last 18 seconds or less. Even villains value their own lives, and in this case our 
villain will attempt the bank robbery only if he is 99.9 percent sure that he can safely escape the 
blast. For today then, the bank is safe! 

The power of the normal distribution is that we can reason probabilistically about a wide range of 
possible alternatives to our mean, giving us an idea of how realistic our mean is. We can use the 
normal distribution any time we want to reason about data for which we know only the mean and 
standard deviation. 

However, this is also the danger of the normal distribution. In practice, if you have information 
about your problem besides the mean and standard deviation, it is usually best to make use of that. 
We’ll see an example of this in a later section. 

SOME TRICKS AND INTUITIONS 

While R makes integrating the normal distribution significantly easier than trying to solve the 
integral by hand, there’s a very useful trick that can simplify things even further when you’re 
working with the normal distribution. For any normal distribution with a known mean and 
standard deviation, you can estimate the area under the curve around p in terms of a. 

For example, the area under the curve for the range from p - a (one standard deviation less than 
the mean) to p + a (one standard deviation greater than the mean) holds 68 percent of the mass of 
the distribution. 

This means that 68 percent of the possible values fall within ± one standard deviation of the mean, 
as shown in Figure 12-8 . 
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Figure 12-8: Sixty-eight percent of the probability density (area under the curve) lies between one 
standard deviation of the mean in either direction. 

We can continue by increasing our distance from the mean by multiples of a. Table 12-1 gives 
probabilities for these other areas. 


Table 12-1: Areas Under the Curve for Different Means 


Distance from the mean 

Probability 

a 

68 percent 

2a 

95 percent 

3a 

99.7 percent 


This little trick is very useful for quickly assessing the likelihood of a value given even a small 
sample. All you need is a calculator to easily figure out the p and o, which means you can do some 
pretty accurate estimations even in the middle of a meeting! 

As an example, when measuring snowfall in Chapter 10 we had the following measurements: 6.2, 
4.5, 5.7, 7.6, 5.3, 8.0, 6.9. For these measurements, the mean is 6.31 and the standard deviation is 
1.17. This means that we can be 95 percent sure that the true value of the snowfall was somewhere 
between 3.97 inches (6.31 - 2 x 1.17) and 8.65 inches (6.31 + 2 x 1.17). No need to manually 
calculate an integral or boot up a computer to use R! 



















Even when we do want to use R to integrate, this trick can be useful for determining a minimum or 
maximum value to integrate from or to. For example, if we want to know the probability that the 
villain’s bomb fuse will last longer than 21 seconds, we don’t want to have to integrate from 21 to 
infinity. What can we use for our upper bound? We can integrate from 21 to 25.46 (which is 20.6 + 

3 x 1.62), which is 3 standard deviations from our mean. Being three standard deviations from the 
mean will account for 99.7 percent of our total probability. The remaining 0.3 percent lies on either 
side of the distribution, so only half of that, 0.15 percent of our probability density, lies in the region 
greater than 25.46. So if we integrate from 21 to 25.46, we’ll only be missing a tiny amount of 
probability in our result. Clearly, we could easily use R to integrate from 21 to something really safe 
such as 30, but this trick allows us to figure out what "really safe" means. 

"N SIGMA" EVENTS 

You may have heard an event being described in terms of sigma events, such as "the fall of the stock 
price was an eight-sigma event." What this expression means is that the observed data is eight 
standard deviations from the mean. We saw the progression of one, two, and three standard 
deviations from the mean in Table 12-1. which were values at 68, 95, and 99.7 percent, 
respectively. You can easily intuit from this that an eight-sigma event must be extremely unlikely. In 
fact, if you ever observe data that is five standard deviations from the mean, it’s likely a good sign 
that your normal distribution is not modeling the underlying data accurately. 

To show the growing rarity of an event as it increases by n sigma, say you are looking at events you 
might observe on a given day. Some are very common, such as waking up to the sunrise. Others are 
less common, such as waking up and it being your birthday. Table 12-2 shows how many days it 
would take to expect the event to happen per one sigma increase. 


Table 12-2: Rarity of an Event as It Increases by n Sigma 


(-/+) Distance from the mean 

Expected every... 

a 

3 days 

2a 

3 weeks 

3a 

1 year 

4a 

4 decades 

5a 

5 millennia 

6a 

1.4 million years 


So a three-sigma event is like waking up and realizing it’s your birthday, but a six-sigma event is 
like waking up and realizing that a giant asteroid is crashing toward earth! 


THE BETA DISTRIBUTION AND THE NORMAL DISTRIBUTION 

You may remember from Chapter 5 that the beta distribution allows us to estimate the true 
probability given that we have observed a desired outcomes and (3 undesired outcomes, where the 










total number of outcomes is a + (3. Based on that, you might take some issue with the notion that the 
normal distribution is truly the best method to model parameter estimation given that we know 
only the mean and standard deviation of any given data set. After all, we could describe a situation 
where a = 3 and (3 = 4 by simply observing three values of 1 and four values of 0. This would give us 
p = 0.43 and a = 0.53. We can then compare the beta distribution with a = 3 and (3 = 4 to a normal 
distribution with p = 0.43 and a = 0.53, as shown in Figure 12-9 . 

Distribution: Normal -Beta 



Figure 12-9: Comparing the beta distribution to the normal distribution 


It’s clear that these distributions are quite different. We can see that for both distributions the 
center of mass appears in roughly the same place, but the bounds for the normal distribution 
extend way beyond the limits of our graph. This demonstrates a key point: only when you know 
nothing about the data other than its mean and variance is it safe to assume a normal distribution. 
For the beta distribution, we know that the value we’re looking for must lie in the range 0 to 1. The 
normal distribution is defined from -oo to oo, which often includes values that cannot possibly exist. 
However, in most cases this is not practically important because measurements out that far are 



















essentially impossible in probabilistic terms. But for our example of measuring the probability of an 
event happening, this missing information is important for modeling our problem. 

So, while the normal distribution is a very powerful tool, it is no substitute for having more 
information about a problem. 

WRAPPING UP 

The normal distribution is an extension of using the mean for estimating a value from observations. 
The normal distribution combines the mean and the standard deviation to model how spread out 
our observations are from the mean. This is important because it allows us to reason about the 
error in our measurements in a probabilistic way. Not only can we use the mean to make our best 
guess, but we can also make probabilistic statements about ranges of possible values for our 
estimate. 

EXERCISES 

Try answering the following questions to see how well you understand the normal distribution. The 
solutions can be found at https://nostarch.com/learnbaves/ . 

1. What is the probability of observing a value five sigma greater than the mean or 
more? 

2. A fever is any temperature greater than 100.4 degrees Fahrenheit. Given the 
following measurements, what is the probability that the patient has a fever? 

100.0,99.8,101.0, 100.5,99.7 

3. Suppose in Chapter 11 we tried to measure the depth of a well by timing coin drops 
and got the following values: 

2.5, 3, 3.5, 4, 2 

The distance an object falls can be calculated (in meters) with the following formula: 
distance = 1/2 x G x time 2 

where G is 9.8 m/s/s. What is the probability that the well is over 500 meters deep? 

4. What is the probability there is no well (i.e., the well is really 0 meters deep)? You’ll 
notice that probability is higher than you might expect, given your observation that there is a 
well. There are two good explanations for this probability being higher than it should. The 
first is that the normal distribution is a poor model for our measurements; the second is 
that, when making up numbers for an example, I chose values that you likely wouldn’t see in 
real life. Which is more likely to you? 
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TOOLS OF PARAMETER ESTIMATION: THE PDF, CDF, AND 

QUANTILE FUNCTION 



In this part so far, we've focused heavily on the building blocks of the normal distribution and its 
use in estimating parameters. In this chapter, we'll dig in a bit more, exploring some mathematical 
tools we can use to make better claims about our parameter estimates. We'll walk through a real- 
world problem and see how to approach it in different ways using a variety of metrics, functions, 
and visualizations. 

This chapter will cover more on the probability density function (PDF); introduce the cumulative 
distribution function (CDF), which helps us more easily determine the probability of ranges of 
values; and introduce quantiles, which divide our probability distributions into parts with equal 
probabilities. For example, a percentile is a 100-quantile, meaning it divides the probability 
distribution into 100 equal pieces. 

ESTIMATING THE CONVERSION RATE FOR AN EMAIL SIGNUP 
LIST 

Say you run a blog and want to know the probability that a visitor to your blog will subscribe to 
your email list. In marketing terms, getting a user to perform a desired event is referred to as 
the conversion event, or simply a conversion, and the probability that a user will subscribe is 
the conversion rate. 

As discussed in Chapter 5. we would use the beta distribution to estimate p, the probability of 
subscribing, when we know k, the number of people subscribed, and n, the total number of visitors. 
The two parameters needed for the beta distribution are a, which in this case represents the total 
subscribed (k), and (3, representing the total not subscribed (n - k). 

When the beta distribution was introduced, you learned only the basics of what it looked like and 
how it behaved. Now you'll see how to use it as the foundation for parameter estimation. We want 
to not only make a single estimate for our conversion rate, but also come up with a range of 
possible values within which we can be very confident the real conversion rate lies. 


THE PROBABILITY DENSITY FUNCTION 

The first tool we’ll use is the probability density function. We've seen the PDF several times so far in 
this book: in Chapter 5 where we talked about the beta distribution; in Chapter 9 when we used 





PDFs to combine Bayesian priors; and once again in Chapter 12. when we talked about the normal 
distribution. The PDF is a function that takes a value and returns the probability of that value. 

In the case of estimating the true conversion rate for your email list, let’s say for the first 40,000 
visitors, you get 300 subscribers. The PDF for our problem is the beta distribution where a = 300 
and (3 = 39,700: 


Be ta(x; 300,39700) 


SOO-l/-, \ 39700-1 

X 1 1 YI 


beta (300,39700) 


We’ve spent a lot of time talking about using the mean as a good estimate for a measurement, given 
some uncertainty. Most PDFs have a mean, which we compute specifically for the beta distribution 
as follows: 


^Beta 


a 

a + p 


This formula is relatively intuitive: simply divide the number of outcomes we care about (300) by 
the total number of outcomes (40,000). This is the same mean you’d get if you simply considered 
each email an observation of 1 and all the others an observation of 0 and then averaged them out. 
The mean is our first stab at estimating a parameter for the true conversion rate. But we’d still like 
to know other possible values for our conversion rate. Let’s continue exploring the PDF to see what 
else we can learn. 


Visualizing and Interpreting the PDF 

The PDF is usually the go-to function for understanding a distribution of probabilities. Figure 13- 
1 illustrates the PDF for the blog conversion rate’s beta distribution. 

PDF Beta(300,39700) 





Figure 13-1: Visualizing the beta PDF for our beliefs in the true conversion rate 


What does this PDF represent? From the data we know that the blog’s average conversion rate is 
simply 

subscribed = _3oo_ = Q Q075 

visited 40,000 

or the mean of our distribution. It seems unlikely that the conversion rate is exactly 0.0075 rather 
than, say, 0.00751. We know the total area under the curve of the PDF must add up to 1, since this 
PDF represents the probability of all possible estimates. We can estimate ranges of values for our 
true conversion rate by looking at the area under the curve for the ranges we care about. In 
calculus, this area under the curve is the integral, and it tells us how much of the total probability is 
in the region of the PDF we’re interested in. This is exactly like how we used integration with the 
normal distribution in the prior chapter. 

Given that we have uncertainty in our measurement, and we have a mean, it could be useful to 
investigate how much more likely it is that the true conversion rate is 0.001 higher or lower than 
the mean of 0.0075 we observed. Doing so would give us an acceptable margin of error (that is, 
we’d be happy with any values in this range). To do this, we can calculate the probability of the 
actual rate being lower than 0.0065, and the probability of the actual rate being higher than 0.0085, 
and then compare them. The probability that our conversion rate is actually much lower than our 
observations is calculated like so: 

_ . 0.0065 

P(much lower) = j Beta(300,39700) = 0.008 

Remember that when we take the integral of a function, we are just summing all the little pieces of 
our function. So, if we take the integral from 0 to 0.0065 for the beta distribution with an a of 300 
and a (3 of 39,700, we are adding up all the probabilities for the values in this range and determining 
the probability that our true conversion rate is somewhere between 0 and 0.0065. 

We can ask questions about the other extreme as well, such as: how likely is it that we actually got 
an unusually bad sample and our true conversion rate is much higher, such as a value greater than, 
say, 0.0085 (meaning a better conversion rate than we had hoped)? 

P(much higher) = J Beta(300,397000) = 0.012 

Here we are integrating from 0.0085 to the largest possible value, which is 1, to determine the 
probability that our true value lies somewhere in this range. So, in this example, the probability that 
our conversion rate is 0.001 higher or more than we observed is actually more likely than the 
probability that it is 0.001 less or worse than observed. This means that if we had to make a 
decision with the limited data we have, we could still calculate how much likelier one extreme is 
than the other: 

P(much higher) _ j oooffi Beta ( 300 > 39 7000) _ 0.012 _ r 
P(much lower) 0065 Beta (300,39700) 0008 

Thus, it’s 50 percent more likely that our true conversion rate is greater than 0.0085 than that it’s 
lower than 0.0065. 



Working with the PDF in R 

In this book we’ve already used two R functions for working with PDFs, dnorm() and dbeta(). For most 
well-known probability distributions, R supports an equivalent dfunction() function for calculating 
the PDF. 

Functions like dbeta() are also useful for approximating the continuous PDF—for example, when you 
want to quickly plot out values like these: 

xs <- seq(0.005,0.01,by=0.00001) 
xs.all <- seq(0,l,by=0.0001) 
plot(xs,dbeta(xs,300,40000-300),type=T,lwd=3, 
ylab="density", 

xlab="probability of subscription", 
main="PDF Beta(300,39700)") 

mE 

To understand the plotting code, see Appendix A . 

In this example code, we’re creating a sequence of values that are each 0.00001 apart—small, but 
not infinitely small, as they would be in a truly continuous distribution. Nonetheless, when we plot 
these values, we see something that looks close enough to a truly continuous distribution (as shown 
earlier in Figure 13-1 ). 

INTRODUCING THE CUMULATIVE DISTRIBUTION FUNCTION 

The most common mathematical use of the PDF is in integration, to solve for probabilities 
associated with various ranges, just as we did in the previous section. However, we can save 
ourselves a lot of effort with the cumulative distribution function (CDF), which sums all parts of our 
distribution, replacing a lot of calculus work. 

The CDF takes in a value and returns the probability of getting that value or lower. For example, the 
CDF for Beta(300,397000) when x - 0.0065 is approximately 0.008. This means that the probability 
of the true conversion rate being 0.0065 or less is 0.008. 

The CDF gets this probability by taking the cumulative area under the curve for the PDF (for those 
comfortable with calculus, the CDF is the anti-derivative of the PDF). We can summarize this 
process in two steps: (1) figure out the cumulative area under the curve for each value of the PDF, 
and (2) plot those values. That’s our CDF. The value of the curve at any given x-value is the 
probability of getting a value of x or lower. At 0.0065, the value of the curve would be 0.008, just as 
we calculated earlier. 

To understand how this works, let’s break the PDF for our problem into chunks of 0.0005 and focus 
on the region of our PDF that has the most probability density: the region between 0.006 and 0.009. 
Figure 13-2 shows the cumulative area under the curve for the PDF of Beta(300,39700). As you can 
see, our cumulative area under the curve takes into account all of the area in the pieces to its left. 





Visualizing the cumulative area under the curve 



Mathematically speaking, Figure 13-2 represents the following sequence of integrals: 

/• 0.0065 

j Beta(300,397000) 

r 0.0065 . . r 0.007 

' Beta(300,397000) + j Beta (300,397000) 

,.0.0065 , . *0.007 . . r 0.0075 

Beta(300,397000)+j ooo65 Beta(300,397000)+ J ooo7 Beta(300,397000) 

(And so on) 

Using this approach, as we move along the PDF, we take into account an increasingly higher 
probability until our total area is 1, or complete certainty. To turn this into the CDF, we can imagine 
a function that looks at only these areas under the curve. Figure 13-3 shows what happens if we 
plot the area under the curve for each of our points, which are 0.0005 apart. 

Now we have a way of visualizing just how the cumulative area under the curve changes as we 
move along the values for our PDF. Of course, the problem is that we’re using these discrete chunks. 
In reality, the CDF just uses infinitely small pieces of the PDF, so we get a nice smooth line 
(see Figure 13-4 1. 










































Cumulative probability 


In our example, we derived the CDF visually and intuitively. Deriving the CDF mathematically is 
much more difficult, and often leads to very complicated equations. Luckily, we typically use code to 
work with the CDF, as we'll see in a few more sections. 

Visualizing just the cumulative probability 



Figure 13-3: Plotting just the cumulative probability from Figure 13-2 












The cumulative distribution function 



Visualizing and Interpreting the CDF 

The PDF is most useful visually for quickly estimating where the peak of a distribution is, and for 
getting a rough sense of the width (variance) and shape of a distribution. However, with the PDF it 
is very difficult to reason about the probability of various ranges visually. The CDF is a much better 
tool for this. For example, we can use the CDF in Figure 13-4 to visually reason about a much wider 
range of probabilistic estimates for our problem than we can using the PDF alone. Let’s go over a 
few visual examples of how we can use this amazing mathematical tool. 

Finding the Median 

The median is the point in the data at which half the values fall on one side and half on the other—it 
is the exact middle value of our data. In other words, the probability of a value being greater than 
the median and the probability of it being less than the median are both 0.5. The median is 
particularly useful for summarizing the data in cases where it contains extreme values. 

Unlike the mean, computing the median can actually be pretty tricky. For small, discrete cases, it’s 
as simple as putting your observations in order and selecting the value in the middle. But for 
continuous distributions like our beta distribution, it’s a little more complicated. 















Thankfully, we can easily spot the median on a visualization of the CDF. We can simply draw a line 
from the point where the cumulative probability is 0.5, meaning 50 percent of the values are below 
this point and 50 percent are above. As Figure 3-5 illustrates, the point where this line intersects 
the x-axis gives us our median! 

Estimating median 



We can see that the median for our data is somewhere between 0.007 and 0.008 (this happens to be 
very close the mean of 0.0075, meaning the data isn’t particularly skewed). 

Approximating Integrals Visually 

When working with ranges of probabilities, we’ll often want to know the probability that the true 
value lies somewhere between some value y and some value x. 

We can solve this kind of problem using integration, but even if R makes solving integrals easier, it’s 
very time-consuming to make sense of the data and to constantly rely on R to compute integrals. 
Since all we want is a rough estimate that the probability of a visitor subscribing to the blog falls 
within a particular range, we don’t need to use integration. The CDF makes it very easy to eyeball 
whether or not a certain range of values has a very high probability or a very low probability of 
occurring. 

To estimate the probability that the conversion rate is between 0.0075 and 0.0085, we can trace 
lines from the x-axis at these points, then see where they meet up with the y-axis. The distance 
between the two points is the approximate integral, as shown in Figure 13-6 . 




Estimating P(x > 0.0075 and x < 0.0085) 



Probability of subscription 

Figure 13-6: Visually performing integration using the CDF 


We can see that on the y-axis these values range from roughly 0.5 to 0.99, meaning that there is 
approximately a 49 percent chance that our true conversion rate lies somewhere between these 
two values. The best part is we didn’t have to do any integration! This is, of course, because the CDF 
represents the integral from the minimum of our function to all possible values. 

So, since nearly all of the probabilistic questions about a parameter estimate involve knowing the 
probability associated with certain ranges of beliefs, the CDF is often a far more useful visual tool 
than the PDF. 

Estimating Confidence Intervals 

Looking at the probability of ranges of values leads us to a very important concept in probability: 
the confidence interval. A confidence interval is a lower and upper bound of values, typically 
centered on the mean, describing a range of high probability, usually 95, 99, or 99.9 percent. When 
we say something like "The 95 percent confidence interval is from 12 to 20," what we mean is that 
there is a 95 percent probability that our true measurement is somewhere between 12 and 20. 
Confidence intervals provide a good method of describing the range of possibilities when we’re 
dealing with uncertain information. 


OTE 


In Bayesian statistics what we are calling a “confidence interval” can go by a few other names, such as 
“critical region” or “critical interval.” In some more traditional schools of statistics, “confidence 
interval” has a slightly different meaning, which is beyond the scope of this book. 

We can estimate confidence intervals using the CDF. Say we wanted to know the range that covers 
80 percent of the possible values for the true conversion rate. We solve this problem by combining 











our previous approaches: we draw lines at the y-axis from 0.1 and 0.9 to cover 80 percent, and then 
simply see where on the x-axis these intersect with our CDF, as shown in Figure 13-7 . 

Estimating 80 percent confidence interval 



Probability of subscription 

Figure 13-7: Estimating our confidence intervals visually using the CDF 


As you can see, the x-axis is intersected at roughly 0.007 and 0.008, which means that there’s an 80 
percent chance that our true conversion rate falls somewhere between these two values. 

Using the CDF in R 

Just as nearly all major PDFs have a function starting with d, like dnorm(), CDF functions start with p, 
such as pnorm(). In R, to calculate the probability that Beta(300,39700) is less than 0.0065, we can 
simply call pbetaQ like this: 


pbeta(0.0065,300,39700) 
> 0.007978686 


And to calculate the true probability that the conversion rate is greater than 0.0085, we can do the 
following: 


pbeta(l,300,39700) - pbeta(0.0085,300,39700) 
> 0.01248151 


The great thing about CDFs is that it doesn’t matter if your distribution is discrete or continuous. If 
we wanted to determine the probability of getting three or fewer heads in five coin tosses, for 
example, we would use the CDF for the binomial distribution like this: 













pbinom(3,5,0.5) 

>0.8125 


THE QUANTILE FUNCTION 

You might have noticed that the median and confidence intervals we took visually with the CDF are 
not easy to do mathematically. With the visualizations, we simply drew lines from the y-axis and 
used those to find a point on the x-axis. 

Mathematically, the CDF is like any other function in that it takes an x value, often representing the 
value we’re trying to estimate, and gives us ay value, which represents the cumulative probability. 
But there is no obvious way to do this in reverse; that is, we can’t give the same function a y to get 
an x. As an example, imagine we have a function that squares values. We know that square(3) = 9, 
but we need an entirely new function—the square root function—to know that the square root of 9 
is 3. 

However, reversing the function is exactly what we did in the previous section to estimate the 
median: we looked at the y-axis for 0.5, then traced it back to the x-axis. What we’ve done visually is 
compute the inverse of the CDF. 

While computing the inverse of the CDF visually is easy for estimates, we need a separate 
mathematical function to compute it for exact values. The inverse of the CDF is an incredibly 
common and useful tool called the quantile function. To compute an exact value for our median and 
confidence interval, we need to use the quantile function for the beta distribution. Just like the CDF, 
the quantile function is often very tricky to derive and use mathematically, so instead we rely on 
software to do the hard work for us. 

Visualizing and Understanding the Quantile Function 

Because the quantile function is simply the inverse of the CDF, it just looks like the CDF rotated 90 
degrees, as shown in Figure 13-8 . 




Quantile function Beta(300,39700) 



Whenever you hear phrases like: 

"The top 10 percent of students .. 

"The bottom 20 percent of earners earn less than .. 

"The top quartile has notably better performance than .. 

you’re talking about values that are found using the quantile function. To look up a quantile 
visually, just find the quantity you’re interested in on the x-axis and see where it meets the y-axis. 
The value on the y-axis is the value for that quantile. Keep in mind that if you’re talking about the 
"top 10 percent," you really want the 0.9 quantile. 

Calculating Quantiles in R 

R also includes the function qnorm() for calculating quantiles. This function is very useful for quickly 
answering questions about what values are bounds of our probability distribution. For example, if 
we want to know the value that 99.9 percent of the distribution is less than, we can use qbetaf) with 
the quantile we’re interested in calculating as the first argument, and the alpha and beta 
parameters of our beta distribution as the second and third arguments, like so: 


qbeta(0.999,300,39700) 
> 0.008903462 


The result is 0.0089, meaning we can be 99.9 percent certain that the true conversion rate for our 
emails is less than 0.0089. We can then use the quantile function to quickly calculate exact values 
for confidence intervals for our estimates. To find the 95 percent confidence interval, we can find 
the values greater than the 2.5 percent lower quantile and the values lower than the 97.5 percent 






upper quantile, and the interval between them is the 95 percent confidence interval (the 
unaccounted region totals 5 percent of the probability density at both extremes). We can easily 
calculate these for our data with qbeta(): 

Our lower bound is qbeta(0.025,300,39700) = 0.0066781 
Our upper bound is qbeta(0.975,300,39700) = 0.0083686 

Now we can confidently say that we are 95 percent certain that the real conversion rate for blog 
visitors is somewhere between 0.67 percent and 0.84 percent. 

We can, of course, increase or decrease these thresholds depending on how certain we want to be. 
Now that we have all of the tools of parameter estimation, we can easily pin down an exact range 
for the conversion rate. The great news is that we can also use this to predict ranges of values for 
future events. 

Suppose an article on your blog goes viral and gets 100,000 visitors. Based on our calculations, we 
know that we should expect between 670 and 840 new email subscribers. 

WRAPPING UP 

We’ve covered a lot of ground and touched on the interesting relationship between the probability 
density function (PDF), cumulative distribution function (CDF), and the quantile function. These 
tools form the basis of how we can estimate parameters and calculate our confidence in those 
estimations. That means we can not only make a good guess as to what an unknown value might be, 
but also determine confidence intervals that very strongly represent the possible values for a 
parameter. 

EXERCISES 

Try answering the following questions to see how well you understand the tools of parameter 
estimation. The solutions can be found at https://nostarch.com/learnbaves/ . 

1. Using the code example for plotting the PDF on page 127. plot the CDF and quantile 
functions. 

2. Returning to the task of measuring snowfall from Chapter 10. say you have the 
following measurements (in inches) of snowfall: 

7.8, 9.4, 10.0, 7.9, 9.4, 7.0, 7.0, 7.1, 8.9, 7.4 

What is your 99.9 percent confidence interval for the true value of snowfall? 

3. A child is going door to door selling candy bars. So far she has visited 30 houses and 
sold 10 candy bars. She will visit 40 more houses today. What is the 95 percent confidence 
interval for how many candy bars she will sell the rest of the day? 
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PARAMETER ESTIMATION WITH PRIOR PROBABILITIES 



In the previous chapter, we looked at using some important mathematical tools to estimate the 
conversion rate for blog visitors subscribing to an email list. However, we haven’t yet covered one 
of the most important parts of parameter estimation: using our existing beliefs about a problem. 

In this chapter, you'll see how we can use our prior probabilities, combined with observed data, to 
come up with a better estimate that blends existing knowledge with the data we've collected. 


PREDICTING EMAIL CONVERSION RATES 


To understand how the beta distribution changes as we gain information, let’s look at another 
conversion rate. In this example, we'll try to figure out the rate at which your subscribers click a 
given link once they’ve opened an email from you. Most companies that provide email list 
management services tell you, in real time, how many people have opened an email and clicked the 


link. 


Our data so far tells us that of the first five people that open an email, two of them click the 
link. Figure 14-1 shows our beta distribution for this data. 



Beta(2,3) likelihood for possible conversion rates 



Figure 14-1 shows Beta(2,3). We used these numbers because two people clicked and three did not 
click. Unlike in the previous chapter, where we had a pretty narrow spike in possible values, here 
we have a huge range of possible values for the true conversion rate because we have very little 
information to work with. Figure 14-2 shows the CDF for this data, to help us more easily reason 
about these probabilities. 

The 95 percent confidence interval (i.e., a 95 percent chance that our true conversion rate is 
somewhere in that range) is marked to make it easier to see. At this point our data tells us that the 
true conversion rate could be anything between 0.05 and 0.8! This is a reflection of how little 
information we’ve actually acquired so far. Given that we've had two conversions, we know the true 
rate can’t be 0, and since we’ve had three non-conversions, we also know it can’t be 1. Almost 
everything else is fair game. 









CDF for Beta(2,3) 



TAKING IN WIDER CONTEXT WITH PRIORS 

But wait a second—you may be new to email lists, but an 80 percent click-through rate sounds 
pretty unlikely. I subscribe to plenty of lists, but I definitely don’t click through to the content 80 
percent of the time that I open the email. Taking that 80 percent rate at face value seems naive 
when I consider my own behavior. 

As it turns out, your email service provider thinks it’s suspicious too. Let’s look at some wider 
context. For blogs listed in the same category as yours, the provider’s data claims that on average 
only 2.4 percent of people who open emails click through to the content. 

In Chapter 9. you learned how we could use past information to modify our belief that Han Solo can 
successfully navigate an asteroid field. Our data tells us one thing, but our background information 
tells us another. As you know by now, in Bayesian terms the data we have observed is 
our likelihood, and the external context information—in this case from our personal experience and 
our email service—is our prior probability. Our challenge now is to figure out how to model our 
prior. Luckily, unlike the case with Han Solo, we actually have some data here to help us. 

The conversion rate of 2.4 percent from your email provider gives us a starting point: now we know 
we want a beta distribution whose mean is roughly 0.024. (The mean of a beta distribution is a / (a 
+ (3).) However, this still leaves us with a range of possible options: Beta(l,41), Beta(2,80), 
Beta(5,200), Beta(24,976), and so on. So which should we use? Let’s plot some of these out and see 
what they look like f Figure 14-3 1. 




















Possible priors for email conversion rates 
Distribution: - Beta(l,41) . Beta(2,80)-Beta(5,200) 



As you can see, the lower the combined a + (3, the wider our distribution. The problem now is that 
even the most liberal option we have, Beta(l,41), seems a little too pessimistic, as it puts a lot of our 
probability density in very low values. We'll stick with this distribution nonetheless, since it is 
based on the 2.4 percent conversion rate in the data from the email provider, and is the weakest of 
our priors. Being a "weak" prior means it will be more easily overridden by actual data as we collect 
more of it. A stronger prior, like Beta(5,200), would take more evidence to change (we’ll see how 
this happens next). Deciding whether or not to use a strong prior is a judgment call based on how 
well you expect the prior data to describe what you’re currently doing. As we’ll see, even a weak 
prior can help keep our estimates more realistic when we’re working with small amounts of data. 
Remember that, when working with the beta distribution, we can calculate our posterior 
distribution (the combination of our likelihood and our prior) by simply adding together the 
parameters for the two beta distributions: 

Beta(a posterior* (3 posterior 3 — Betel ^(Xlikelihood “I" OCprior* (Blikelihood (^priorj 



















Using this formula, we can compare our beliefs with and without priors, as shown in Figure 14-4 . 

Estimates of converstion rate with and without prior 

Distribution: - With Prior . No Prior 



Figure 14-4: Comparing our likelihood (no prior) to our posterior (with prior) 

Wow! That’s quite sobering. Even though we’re working with a relatively weak prior, we can see 
that it has made a huge impact on what we believe are realistic conversion rates. Notice that for the 
likelihood with no prior, we have some belief that our conversion rate could be as high as 80 
percent. As mentioned, this is highly suspicious; any experienced email marketer would tell you 
than an 80 percent conversion rate is unheard of. Adding a prior to our likelihood adjusts our 
beliefs so that they become much more reasonable. But I still think our updated beliefs are a bit 
pessimistic. Maybe the email’s true conversion rate isn’t 40 percent, but it still might be better than 
this current posterior distribution suggests. 

How can we prove that our blog has a better conversion rate than the sites in the email provider’s 
data, which have a 2.4 percent conversion rate? The way any rational person does: with more data! 
We wait a few hours to gather more results and now find that out of 100 people who opened your 
email, 25 have clicked the link! Let’s look at the difference between our new posterior and 
likelihood, shown in Figure 14-5 . 






Estimates of converstion rate after more observations with and without prior 
Distribution: - With Prior . No Prior 



Figure 14-5: Updating our beliefs with more data 


As we continue to collect data, we see that our posterior distribution using a prior is starting to shift 
toward the one without the prior. Our prior is still keeping our ego in check, giving us a more 
conservative estimate for the true conversion rate. However, as we add evidence to our likelihood, 
it starts to have a bigger impact on what our posterior beliefs look like. In other words, the 
additional observed data is doing what it should: slowly swaying our beliefs to align with what it 
suggests. So let’s wait overnight and come back with even more data! 

In the morning we find that 300 subscribers have opened their email, and 86 of those have clicked 
through. Figure 14-6 shows our updated beliefs. 

What we’re witnessing here is the most important point about Bayesian statistics: the more data we 
gather, the more our prior beliefs become diminished by evidence. When we had almost no 
evidence, our likelihood proposed some rates we know are absurd (e.g., 80 percent click-through), 
both intuitively and from personal experience. In light of little evidence, our prior beliefs squashed 
any data we had. 

But as we continue to gather data that disagrees with our prior, our posterior beliefs shift toward 
what our own collected data tells us and away from our original prior. 

















Another important takeaway is that we started with a pretty weak prior. Even then, after just a day 
of collecting a relatively small set of information, we were able to find a posterior that seems much, 
much more reasonable. 

Estimates converging with more data with and without prior 
Distribution: - With Prior . No Prior 



The prior probability distribution in this case helped tremendously with keeping our estimate 
much more realistic in the absence of data. This prior probability distribution was based on real 
data, so we could be fairly confident that it would help us get our estimate closer to reality. 
However, in many cases we simply don’t have any data to back up our prior. So what do we do 
then? 

PRIOR AS A MEANS OF QUANTIFYING EXPERIENCE 

Because we knew the idea of an 80 percent click-through rate for emails was laughable, we used 
data from our email provider to come up with a better estimate for our prior. However, even if we 
didn’t have data to help establish our prior, we could still ask someone with a marketing 
background to help us make a good estimate. A marketer might know from personal experience 
that you should expect about a 20 percent conversion rate, for example. 


















Given this information from an experienced professional, you might choose a relatively weak prior 
like Beta(2,8) to suggest that the expected conversion rate should be around 20 percent. This 
distribution is just a guess, but the important thing is that we can quantify this assumption. For 
nearly every business, experts can often provide powerful prior information based simply on 
previous experience and observation, even if they have no training in probability specifically. 

By quantifying this experience, we can get more accurate estimates and see how they can change 
from expert to expert. For example, if a marketer is certain that the true conversion rate should be 
20 percent, we might model this belief as Beta(200,800). As we gather data, we can compare 
models and create multiple confidence intervals that quantitatively model any expert beliefs. 
Additionally, as we gain more and more information, the difference due to these prior beliefs will 
decrease. 

IS THERE A FAIR PRIOR TO USE WHEN WE KNOW NOTHING? 

There are certain schools of statistics that teach that you should always add 1 to both a and (3 when 
estimating parameters with no other prior. This corresponds to using a very weak prior that holds 
that each outcome is equally likely: Beta(l,l). The argument is that this is the "fairest" (i.e., 
weakest) prior we can come up with in the absence of information. The technical term for a fair 
prior is a noninformative prior. Beta(l,l) is illustrated in Figure 14-7 . 

Non informative prior Beta (1,1) 



Conversion rate 


Figure 14-7: The noninformative prior Beta(l,l) 







As you can see, this is a perfectly straight line, so that all outcomes are then equally likely and the 
mean likelihood is 0.5. The idea of using a noninformative prior is that we can add a prior to help 
smooth out our estimate, but that prior isn’t biased toward any particular outcome. However, while 
this may initially seem like the fairest way to approach the problem, even this very weak prior can 
lead to some strange results when we test it out. 

Take, for example, the probability that the sun will rise tomorrow. Say you are 30 years old, and so 
you’ve experienced about 11,000 sunrises in your lifetime. Now suppose someone asks the 
probability that the sun will rise tomorrow. You want to be fair and use a noninformative prior, 
Beta(l,l). The distribution that represents your belief that the sun will not rise tomorrow would be 
Beta(l,11001), based on your experiences. While this gives a very low probability for the sun not 
rising tomorrow, it also suggests that we would expect to see the sun not rise at least once by the 
time you reach 60 years old. The so-called "noninformative" prior is providing a pretty strong 
opinion about how the world works! 

You could argue that this is only a problem because we understand celestial mechanics, so we 
already have strong prior information we can’t forget. But the real problem is that we’ve never 
observed the case where the sun doesn’t rise. If we go back to our likelihood function without the 
noninformative prior, we get Beta(0,11000). 

However, when either a or (3 < 0, the beta distribution is undefined, which means that the correct 
answer to "What is the probability that the sun will rise tomorrow?" is that the question doesn’t 
make sense because we’ve never seen a counterexample. 

As another example, suppose you found a portal that transported you and a friend to a new world. 
An alien creature appears before you and fires a strange-looking gun at you that just misses. Your 
friend asks you, "What’s the probability that the gun will misfire?" This is a completely alien world 
and the gun looks strange and organic, so you know nothing about its mechanics at all. 

This is, in theory, the ideal scenario for using a noninformative prior, since you have absolutely no 
prior information about this world. If you add your noninformative prior, you get a posterior 
Beta(l,2) probability that the gun will misfire (we observed a = 0 misfires and (3 = 1 successful 
fires). This distribution tells us the mean posterior probability of a misfire is 1/3, which seems 
astoundingly high given that you don’t even know if the strange gun can misfire. Again, even though 
Beta(0,l) is undefined, using it seems like the rational approach to this problem. In the absence of 
sufficient data and any prior information, your only honest option is to throw your hands in the air 
and tell your friend, "I have no clue how to even reason about that question!" 

The best priors are backed by data, and there is never really a true "fair" prior when you have a 
total lack of data. Everyone brings to a problem their own experiences and perspective on the 
world. The value of Bayesian reasoning, even when you are subjectively assigning priors, is that you 
are quantifying your subjective belief. As we’ll see later in the book, this means you can compare 
your prior to other people’s and see how well it explains the world around you. A Beta(l,l) prior is 
sometimes used in practice, but you should use it only when you earnestly believe that the two 
possible outcomes are, as far as you know, equally likely. Likewise, no amount of mathematics can 
make up for absolute ignorance. If you have no data and no prior understanding of a problem, the 
only honest answer is to say that you can’t conclude anything at all until you know more. 

All that said, it’s worth noting that this topic of whether to use Beta(l,l) or Beta(0,0) has a long 
history, with many great minds arguing various positions. Thomas Bayes (namesake of Bayes’ 
theorem) hesitantly believed in Beta(l,l); the great mathematician Simon-Pierre Laplace was quite 
certain Beta(l,l) was correct; and the famous economist John Maynard Keynes thought using 
Beta(l,l) was so preposterous that it discredited all of Bayesian statistics! 



WRAPPING UP 

In this chapter, you learned how to incorporate prior information about a problem to arrive at 
much more accurate estimates for unknown parameters. When we have only a little information 
about a problem, we can easily get probabilistic estimates that seem impossible. But we might have 
prior information that can help us make better inferences from that small amount of data. By 
adding this information to our estimates, we get much more realistic results. 

Whenever possible, it’s best to use a prior probability distribution based on actual data. However, 
often we won’t have data to support our problem, but we either have personal experience or can 
turn to experts who do. In these cases, it’s perfectly fine to estimate a probability distribution that 
corresponds to your intuition. Even if you’re wrong, you’ll be wrong in a way that is recorded 
quantitatively. Most important, even if your prior is wrong, it will eventually be overruled by data 
as you collect more observations. 


EXERCISES 


Try answering the following questions to see how well you understand priors. The solutions can be 


found at https://nostarch.com/learnbayes/ 


1. Suppose you’re playing air hockey with some friends and flip a coin to see who starts 
with the puck. After playing 12 times, you realize that the friend who brings the coin almost 
always seems to go first: 9 out of 12 times. Some of your other friends start to get suspicious 
Define prior probability distributions for the following beliefs: 


• One person who weakly believes that the friend is cheating and the true rate of 
coming up heads is closer to 70 percent. 


• One person who very strongly trusts that the coin is fair and provided a 50 
percent chance of coming up heads. 


• One person who strongly believes the coin is biased to come up heads 70 
percent of the time. 


2. To test the coin, you flip it 20 more times and get 9 heads and 11 tails. Using the 
priors you calculated in the previous question, what are the updated posterior beliefs in the 
true rate of flipping a heads in terms of the 95 percent confidence interval? 



PART IV 

HYPOTHESIS TESTING: THE HEART OF STATISTICS 
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FROM PARAMETER ESTIMATION TO HYPOTHESIS TESTING: 

BUILDING A BAYESIAN A/B TEST 



In this chapter, we’re going to build our first hypothesis test, an A/B test. Companies often use A/B 
tests to try out product web pages, emails, and other marketing materials to determine which will 
work best for customers. In this chapter, we'll test our belief that removing an image from an email 
will increase the click-through rate against the belief that removing it will hurt the click-through 
rate. 

Since we already know how to estimate a single unknown parameter, all we need to do for our test 
is estimate both parameters—that is, the conversion rates of each email. Then we’ll use R to run a 
Monte Carlo simulation and determine which hypothesis is likely to perform better—in other 
words, which variant, A or B, is superior. A/B tests can be performed using classical statistical 
techniques such as t-tests, but building our test the Bayesian way will help us understand each part 
of it intuitively and give us more useful results as well. 

We've covered the basics of parameter estimation pretty well at this point. We’ve seen how to use 
the PDF, CDF, and quantile functions to learn the likelihood of certain values, and we've seen how to 
add a Bayesian prior to our estimate. Now we want to use our estimates to compare twounknown 
parameters. 


SETTING UP A BAYESIAN A/B TEST 


Keeping with our email example from the previous chapter, imagine we want to see whether 
adding an image helps or hurts the conversion rate for our blog. Previously, the weekly email has 
included some image. For our test we’re going to send one variant with images like usual, and 
another without images. The test is called an A/B test because we are comparing variant A (with 
image) and variant B (without) to determine which one performs better. 

Let’s assume at this point we have 600 blog subscribers. Because we want to exploit the knowledge 
gained during this experiment, we’re only going to be running our test on 300 of them; that way, we 
can send the remaining 300 subscribers what we believe to be the most effective variant of the 
email. 

The 300 people we’re going to test will be split up into two groups, A and B. Group A will receive 
the usual email with a big picture at the top, and group B will receive an email with no picture. The 
hope is that a simpler email will feel less "spammy" and encourage users to click through to the 
content. 


Finding Our Prior Probability 

Next, we need to figure out what prior probability we’re going to use. We’ve run an email campaign 
every week, so from that data we have a reasonable expectation that the probability of clicking the 
link to the blog on any given email should be around 30 percent. To make things simple, we’ll use 
the same prior for both variants. We’ll also choose a pretty weak version of our prior distribution, 
meaning that it considers a wider range of conversion rates to be probable. We’re using a weak 
prior because we don’t really know how well we expect B to do, and this is a new email campaign, 
so other factors could cause a better or worse conversion. We'll settle on Beta(3,7) for our prior 
probability distribution. This distribution allows us to represent a beta distribution where 0.3 is the 
mean, but a wide range of possible alternative rates are considered. We can see this distribution 
in Figure 15-1 . 

Weak Prior Belief in Conversion Rate Beta(3,7) 



Conversion rate 

Figure 15-1: Visualizing our prior probability distribution 

All we need now is our likelihood, which means we need to collect data. 

Collecting Data 


We send out our emails and get the results in Table 15-1 . 
Table 15-1: Email Click-through Rates 



Clicked 

Not clicked 

Observed conversion rate 

Variant A 

36 

114 

0.24 

Variant B 

50 

100 

0.33 






We can treat each of these variants as a separate parameter we’re trying to estimate. In order to 
arrive at a posterior distribution for each, we need to combine both their likelihood distribution 
and prior distribution. We've already decided that the prior for these distributions should be 
Beta(3,7), representing a relatively weak belief in what possible values we expect the conversion 
rate to be, given no additional information. We say this is a weak belief because we don’t believe 
very strongly in a particular range of values, and consider all possible rates with a reasonably high 
probability. For the likelihood of each, we’ll again use the beta distribution, making a the number of 
times the link was clicked through and (3 the number of times it was not. 

Recall that: 


Beta(a posterior^ [3posterior ) — Beta(a p rior "T CXlikelihood, (Bprior "t" |3likelihood) 


Variant A will be represented by Beta(36+3,114+7) and variant B by Beta(50+3,100+7). Figure 15- 
2 shows the estimates for each parameter side by side. 

Parameter estimation variants A and B 



Clearly, our data suggests that variant B is superior, in that it garners a higher conversion rate. 
However, from our earlier discussion on parameter estimation, we know that the true conversion 
rate is one of a range of possible values. We can also see here that there’s an overlap between the 
possible true conversion rates for A and B. What if we were just unlucky in our A responses, and A’s 
true conversion rate is in fact much higher? What if we were also just lucky with B, and its 
conversion rate is in fact much lower? It's easy to see a possible world in which A is actually the 
better variant, even though it did worse on our test. So the real question is: how sure can we be that 
B is the better variant? This is where the Monte Carlo simulation comes in. 






MONTE CARLO SIMULATIONS 

The accurate answer to which email variant generates a higher click-through rate lies somewhere 
in the intersection of the distributions of A and B. Fortunately, we have a way to figure it out: a 
Monte Carlo simulation. A Monte Carlo simulation is any technique that makes use of random 
sampling to solve a problem. In this case, we’re going to randomly sample from the two 
distributions, where each sample is chosen based on its probability in the distribution so that 
samples in a high-probability region will appear more frequently. For example, as we can see 
in Figure 15-2. a value greater than 0.2 is far more likely to be sampled from A than a value less 
than 0.2. However, a random sample from distribution B is nearly certain to be above 0.2. In our 
random sampling, we might pick out a value of 0.2 for variant A and 0.35 for variant B. Each sample 
is random, and based on the relative probability of values in the A and B distributions. The values 
0.2 for A and 0.35 for B both could be the true conversion rate for each variant based on the 
evidence we’ve observed. This individual sampling from the two distributions confirms the belief 
that variant B is, in fact, superior to A, since 0.35 is larger than 0.2. 

However, we could also sample 0.3 for variant A and 0.27 for variant B, both of which are 
reasonably likely to be sampled from their respective distributions. These are also both realistic 
possible values for the true conversion rate of each variant, but in this case, they indicate that 
variant B is actually worse than variant A. 

We can imagine that the posterior distribution represents all the worlds that could exist based on 
our current state of beliefs regarding each conversion rate. Every time we sample from each 
distribution, we’re seeing what one possible world could look like. We can tell visually in Figure 15- 
1 that we should expect more worlds where B is truly the better variant. The more frequently we 
sample, the more precisely we can tell in exactly how many worlds, of all the worlds we’ve sampled 
from, B is the better variant. Once we have our samples, we can look at the ratio of worlds where B 
is the best to the total number of worlds we’ve looked at and get an exact probability that B is in 
fact greater than A. 

In How Many Worlds Is B the Better Variant? 

Now we just have to write the code that will perform this sampling. R’s rbeta() function allows us to 
automatically sample from a beta distribution. We can consider each comparison of two samples a 
single trial. The more trials we run, the more precise our result will be, so we’ll start with 100,000 
trials by assigning this value to the variable n.trials: 

n.trials <- 100000 

Next we’ll put our prior alpha and beta values into variables: 

prior.alpha <- 3 
prior.beta <- 7 

Then we need to collect samples from each variant. We’ll use rbetaO for this: 

a. samples <- rbeta(n.trials,36+prior.alpha,114+prior.beta) 

b. samples <- rbeta(n.trials,50+prior.alpha,100+prior.beta) 








We’re saving the results of the rbetaO samples into variables, too, so we can access them more easily. 
For each variant, we input the number of people who clicked through to the blog and the number of 
people who didn’t. 

Finally, we compare how many times the b.samples are greater than the a.samples and divide that 
number by n.trials, which will give us the percentage of the total trials where variant B was greater 
than variant A: 


p.b_superior <- sum(b.samples > a.samples)/n.trials 


The result we end up with is: 


p.b_superior 

>0.96 


What we see here is that in 96 percent of the 100,000 trials, variant B was superior. We can imagine 
this as looking at 100,000 possible worlds. Based on the distribution of possible conversion rates 
for each variant, in 96 percent of the worlds variant B was the better of the two. This result shows 
that, even with a relatively small number of observed samples, we have a pretty strong belief that B 
is the better variant. If you’ve ever done t -tests in classical statistics, this is roughly equivalent—if 
we used a Beta(l,l) prior—to getting a p-value of 0.04 from a single-tailed t -test (often considered 
"statistically significant"). However, the beauty of our approach is that we were able to build this 
test from scratch using just our knowledge of probability and a straightforward simulation. 

How Much Better Is Each Variant B Than Each Variant A? 

Now we can say precisely how certain we are that B is the superior variant. However, if this email 
campaign were for a real business, simply saying "B is better" wouldn’t be a very satisfactory 
answer. Don’t you really want to know how much better ? 

This is the real power of our Monte Carlo simulation. We can take the exact results from our last 
simulation and test how much better variant B is likely to be by looking at how many times greater 
the B samples are than the A samples. In other words, we can look at this ratio: 

B samples 

A samples 

In R, if we take the a.samples and b.samples from before, we can compute b.samples/a.samples. This will 
give us a distribution of the relative improvements from variant A to variant B. When we plot out 
this distribution as a histogram, as shown in Figure 15-3, we can see how much we expect variant B 
to improve our click-through rate. 

From this histogram we can see that variant B will most likely be about a 40 percent improvement 
(ratio of 1.4) over A, although there is an entire range of possible values. As we discussed in Chapter 
13. the cumulative distribution function (CDF) is much more useful than a histogram for reasoning 
about our results. Since we’re working with data rather than a mathematical function, we’ll 
compute the empirical cumulative distribution function with R’s ecdf() function. The eCDF is 
illustrated in Figure 15-4 . 
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Figure 15-3: A histogram of possible improvements we might see 
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Figure 15-4: A distribution of possible improvements we might see 

Now we can see our results more clearly. There is really just a small, small chance that A is better, 
and even if it is better, it’s not going to be by much. We can also see that there’s about a 25 percent 



























chance that variant B is a 50 percent or more improvement over A, and even a reasonable chance it 
could be more than double the conversion rate! Now, in choosing B over A, we can actually reason 
about our risk by saying, "The chance that B is 20 percent worse is roughly the same that it’s 100 
percent better.” Sounds like a good bet to me, and a much better statement of our knowledge than, 
"There is a statistically significant difference between B and A.” 

WRAPPING UP 

In this chapter we saw how parameter estimation naturally extends to a form of hypothesis testing. 
If the hypothesis we want to test is "variant B has a better conversion rate than variant A,” we can 
start by first doing parameter estimation for the possible conversion rates of each variant. Once we 
know those estimates, we can use the Monte Carlo simulation in order to sample from them. By 
comparing these samples, we can come up with a probability that our hypothesis is true. Finally, we 
can take our test one step further by seeing how well our new variant performs in these possible 
worlds, estimating not only whether the hypothesis is true, but also how much improvement we are 
likely to see. 

EXERCISES 

Try answering the following questions to see how well you understand running A/B tests. The 
solutions can be found at https://nostarch.com/learnbaves/ . 

1. Suppose a director of marketing with many years of experience tells you he believes 
very strongly that the variant without images (B) won’t perform any differently than the 
original variant. How could you account for this in our model? Implement this change and 
see how your final conclusions change as well. 

2. The lead designer sees your results and insists that there’s no way that variant B 
should perform better with no images. She feels that you should assume the conversion rate 
for variant B is closer to 20 percent than 30 percent. Implement a solution for this and again 
review the results of our analysis. 

3. Assume that being 95 percent certain means that you’re more or less "convinced" of a 
hypothesis. Also assume that there’s no longer any limit to the number of emails you can 
send in your test. If the true conversion for A is 0.25 and for B is 0.3, explore how many 
samples it would take to convince the director of marketing that B was in fact superior. 
Explore the same for the lead designer. You can generate samples of conversions with the 
following snippet of R: 


true.rate <- 0.25 

number.of.samples <-100 

results <- runif(number.of.samples) <= true.rate 
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INTRODUCTION TO THE BAYES FACTOR AND POSTERIOR 
ODDS: THE COMPETITION OF IDEAS 



In the previous chapter, we saw that we can view a hypothesis test as an extension of parameter 
estimation. In this chapter, we’ll think about hypothesis tests instead as a way to compare ideas 
with an important mathematical tool called the Bayes factor. The Bayes factor is a formula that tests 
the plausibility of one hypothesis by comparing it to another. The result tells us how many times 
more likely one hypothesis is than the other. 

We'll then see how to combine the Bayes factor with our prior beliefs to come up with the posterior 
odds, which tells us how much stronger one belief is than the other at explaining our data. 


REVISITING BAYES' THEOREM 

Chapter 6 introduced Bayes’ theorem, which takes the following form: 



Recall that there are three parts of this formula that have special names: 


• P[H | D ) is the posterior probability, which tells us how strongly we should believe in 
our hypothesis, given our data. 

• P[H ) is the prior belief or the probability of our hypothesis prior to looking at the 
data. 

• P[D | H) is the likelihood of getting the existing data if our hypothesis were true. 


The last piece, P[D ), is the probability of the data observed independent of the hypothesis. We 
need P[D) in order to make sure that our posterior probability is correctly placed somewhere 
between 0 and 1. If we have all of these pieces of information, we can calculate exactly how strongly 
we should believe in our hypothesis given the data we’ve observed. But as I mentioned in Chapter 
8, P[D ) is often very hard to define. In many cases, it’s not obvious how we can figure out the 
probability of our data. P{D ) is also totally unnecessary if all we care about is comparing the relative 
strength of two different hypotheses. 

For these reasons, we often use the proportional form of Bayes’ theorem, which allows us to analyze 
the strength of our hypotheses without knowing P[D). It looks like this: 


P[H | D) oc P{H) x P[D | H] 





In plain English, the proportional form of Bayes’ theorem says that the posterior probability of our 
hypothesis is proportional to the prior multiplied by the likelihood. We can use this to compare two 
hypotheses by examining the ratio of the prior belief multiplied by the likelihood for each 
hypothesis using the ratio of posteriors formula: 

P(H 2 )xP(D\H 2 ) 

What we have now is a ratio of how well each of our hypotheses explains the data we’ve observed. 
That is, if the ratio is 2, then Hi explains the observed data twice as well as H 2 , and if the ratio is 1/2, 
then Hz explains the data twice as well as Hi. 

BUILDING A HYPOTHESIS TEST USING THE RATIO OF 
POSTERIORS 

The ratio of posteriors formula gives us the posterior odds, which allows us to test hypotheses or 
beliefs we have about data. Even when we do know P[D ), the posterior odds is a useful tool because 
it allows us to compare ideas. To better understand the posterior odds, we’ll break down the ratio 
of posteriors formula into two parts: the likelihood ratio, or the Bayes factor, and the ratio of prior 
probabilities. This is a standard, and very helpful, practice that makes it much easier to reason 
about the likelihood and the prior probability separately. 

The Bayes Factor 

Using the ratio of posteriors formula, let’s assume that P[H i) = P[Hz )—that is, that our prior belief in 
each hypothesis is the same. In that case, the ratio of prior beliefs in the hypotheses is just 1, so all 
that’s left is: 

p{d\h,) 
p(d\h 2 ) 

This is the Bayes factor, the ratio between the likelihoods of two hypotheses. 

Take a moment to really think about what this equation is saying. When we consider how we’re 
going to argue for our Hi —that is, our belief about the world—we think about gathering evidence 
that supports our beliefs. A typical argument, therefore, involves building up a set of data, D h that 
supports Hi, and then arguing with a friend who has gathered a set of data, D 2 , that supports their 
hypothesis, H 2 . 

In Bayesian reasoning, though, we’re not gathering evidence to support our ideas; we’re looking to 
see how well our ideas explain the evidence in front of us. What this ratio tells us is the likelihood of 
what we’ve seen given what we believe to be true compared to what someone e/sebelieves to be 
true. Our hypothesis wins when it explains the world better than the competing hypothesis. 

If, however, the competing hypothesis explains the data much better than ours, it might be time to 
change our beliefs. The key here is that in Bayesian reasoning, we don’t worry about supporting our 
beliefs—we are focused on how well our beliefs support the data we observe. In the end, data can 
either confirm our ideas or lead us to change our minds. 

Prior Odds 

So far we have assumed that the prior probability of each hypothesis is the same. This is clearly not 
always the case: a hypothesis may explain the data well even if it is very unlikely. If you’ve lost your 



phone, for example, both the belief that you left it in the bathroom and the belief that aliens took it 
to examine human technology explain the data quite well. However, the bathroom hypothesis is 
clearly much more likely. This is why we need to consider the ratio of prior probabilities: 

P(»i) 

P{H Z ) 

This ratio compares the probability of two hypotheses before we look at the data. When used in 
relation to the Bayes factor, this ratio is called the prior odds in our Hi and written as 0[Hi). This 
representation is helpful because it lets us easily note how strongly (or weakly) we believe in the 
hypothesis we’re testing. When this number is greater than 1, it means the prior odds favor our 
hypothesis, and when it is a fraction less than 1, it means they’re against our hypothesis. For 
example, 0{H i) = 100 means that, without any other information, we believe Hi is 100 times more 
likely than the alternative hypothesis. On the other hand, when 0[H i) = 1/100, the alternative 
hypothesis is 100 times more likely than ours. 


Posterior Odds 

If we put together the Bayes factor and the prior odds, we get the posterior odds: 


posterior odds = 0(H l ) 


P{D\H 2 ) 


The posterior odds calculates how many times better our hypothesis explains the data than a 
competing hypothesis. 

Table 16-1 lists some guidelines for evaluating various posterior odds values. 


Table 16-1: Guidelines for Evaluating Posterior Odds 


Posterior odds 

Strength of evidence 

lto 3 

Interesting, but nothing conclusive 

3 to 20 

Looks like we’re on to something 

20 to 150 

Strong evidence in favor of Hi 

> 150 

Overwhelming evidence 


We can look at the reciprocal of these odds to decide when to change our mind about an idea. 

While these values can serve as a useful guide, Bayesian reasoning is still a form of reasoning, which 
means you have to use some judgment. If you’re having a casual disagreement with a friend, a 
posterior odds of 2 might be enough to make you feel confident. If you’re trying to figure out if 
you’re drinking poison, a posterior odds of 100 still might not cut it. 

Next, we'll look at two examples in which we use the Bayes factor to determine the strength of our 
beliefs. 



Testing for a Loaded Die 

We can use the Bayes factor and posterior odds as a form of hypothesis testing in which each test is 
a competition between two ideas. Suppose your friend has a bag with three six-sided dice in it, and 
one die is weighted so that it lands on 6 half the time. The other two are traditional dice whose 
probability of rolling a 6 is Ve. Your friend pulls out a die and rolls 10 times, with the following 
results: 


6,1, 3, 6, 4, 5, 6, 1, 2, 6 


We want to figure out if this is the loaded die or a regular die. We can call the loaded die Hi and the 
regular die H 2 . 

Let’s start by working out the Bayes factor: 


p(i> \n x ) 

p(d\h 2 ) 


The first step is calculating P[D \ H), or the likelihood of Hi and H 2 given the data we’ve observed. In 
this example, your friend rolled four 6s and six non-6s. We know that if the die is loaded, the 
probability of rolling a 6 is 1/2 and the probability of rolling any non-6 is also 1/2. This means the 
likelihood of seeing this data given that we’ve used the loaded die is: 

/,\6 


7-(Z)| //J =| i 


(\ 


V2 j 


= 0.00098 


In the case of the fair die, the probability of rolling a 6 is 1/6, while the probability of rolling 
anything else is 5/6. This means our likelihood of seeing this data for H 2 , the hypothesis that the die 
is fair, is: 



0.00026 


Now we can compute our Bayes factor, which will tell us how much better Hi is than H 2 at 
explaining our data, assuming each hypothesis was equally probable in the first place (meaning that 
the prior odds ratio is 1): 


P{D | H,) _ 0.00098 
P(D | H 2 ) ~ 0.00026 


This means that Hi, the belief that the die is loaded, explains the data we observed almost four 
times better than H 2 . 


However, this is true only if Hi and H 2 are both just as likely to be true in the first place. But we 
know there are two fair dice in the bag and only one loaded die, which means that each hypothesis 
was not equally likely. Based on the distribution of the dice in the bag, we know that these are the 
prior probabilities for each hypothesis: 


From these, we can calculate the prior odds for Hi: 



1 

3 _! 

2 2 

3 

Because there is only one loaded die in the bag and two fair dice, we’re twice as likely to pull a fair 
die than a loaded one. With our prior odds for Hi, we can now compute our full posterior odds: 

, x P(D\H,) 1 

posterior odds = 0( H, ) x —--- = — x 3.77 = 1.89 

V V V P(D\H 2 ) 2 

While the initial likelihood ratio showed that Hi explained the data almost four times as well as H 2 , 
the posterior odds shows us that, because Hi is only half as likely as H 2 , Hi is actually only about 
twice as strong of an explanation as H 2 . 

From this, if you absolutely had to draw a conclusion about whether the die was loaded or not, your 
best bet would be to say that it is indeed loaded. However, a posterior odds of less than 2 is not 
particularly strong evidence in favor of Hi. If you really wanted to know whether or not the die was 
loaded, you would need to roll it a few more times until the evidence in favor of one hypothesis or 
the other was great enough for you to make a stronger decision. 

Now let’s look at a second example of using the Bayes factor to determine the strength of our 
beliefs. 

Self-Diagnosing Rare Diseases Online 

Many people have made the mistake of looking up their symptoms and ailments online late at night, 
only to find themselves glued to the screen in terror, sure they are the victim of some strange and 
terrible disease! Unfortunately for them, their analysis almost always excludes Bayesian reasoning, 
which might help alleviate some unnecessary anxiety. In this example, let’s assume you’ve made the 
mistake of looking up your symptoms and have found two possible ailments that fit. Rather than 
panicking for no reason, you'll use posterior odds to weigh the odds of each. 

Suppose you wake up one day with difficulty hearing and a ringing (tinnitus) in one ear. It annoys 
you all day, and when you get home from work, you decide it’s high time to search the web for 
potential causes of your symptoms. You become increasingly concerned, and finally come to two 
possible hypotheses: 

Earwax impaction You have too much earwax in one ear. A quick visit to the doctor will clear up 
this condition. 

Vestibular schwannoma You have a brain tumor growing on the myelin sheath of the vestibular 
nerve, causing irreversible hearing loss and possibly requiring brain surgery. 

Of the two, the possibility of vestibular schwannoma is the most worrying. Sure, it could be just 
earwax, but what if it’s not? What if you do have a brain tumor? Since you’re most worried about 
the possibility of a brain tumor, you decide to make this your Hi. Your H 2 is the hypothesis that you 
have too much earwax in one ear. 

Let’s see if posterior odds can calm you down. 

As in our last example, we’ll start our exploration by looking at the likelihood of observing these 
symptoms if each hypothesis were true, and compute the Bayes factor. This means we need to 
compute P[D \ H). You've observed two symptoms: hearing loss and tinnitus. 


prior odds = O () = 


P(H t ) 

P(H 2 ) 



For vestibular schwannoma, the probability of experiencing hearing loss is 94 percent, and the 
probability of experiencing tinnitus is 83 percent, which means the probability of having hearing 
loss and tinnitus if you have vestibular schwannoma is: 

P (D | Hi) = 0.94 x 0.89 = 0.78 

Next, we’ll do the same for H 2 . For earwax impaction, the probability of experiencing hearing loss is 
63 percent, and the probability of experiencing tinnitus is 55 percent. The likelihood of having your 
symptoms if you have impacted earwax is: 

P [D | H 2 ) = 0.63 x 0.55 = 0.35 


Now we have enough information to look at our Bayes factor: 

P(D\H t ) 0,78 22? 

P(D I H 2 ) 0.35 

Yikes! Looking at just the Bayes factor doesn’t do much to help alleviate your concerns of having a 
brain tumor. Taking only the likelihood ratio into account, it appears that you’re more than twice as 
likely to experience these symptoms if you have vestibular schwannoma than if you have earwax 
impaction! Luckily, we’re not done with our analysis yet. 

The next step is to determine the prior odds of each hypothesis. Symptoms aside, how likely is it for 
someone to have one issue versus the other? We can find epidemiological data for each of these 
diseases. It turns out that vestibular schwannoma is a rare condition. Only 11 in 1,000,000 people 
contract it each year. The prior odds look like this: 


1,000,000 

Unsurprisingly, earwax impaction is much, much more common, with 37,000 cases per 1,000,000 
people in a year: 

v 37,000 

’ ~ 1,000,000 




To get the prior odds for Hi, we need to look at the ratio of these two prior probabilities: 

11 


0{H,) = 


P{Ha) 1,000,000 11 


P(H 2 ) 37,000 37,000 

1,000,000 


Based on prior information alone, a given person is about 3,700 times more likely to have an 
earwax impaction than vestibular schwannoma. But before you can breathe easy, we need to 
compute the full posterior odds. This just means multiplying our Bayes factor by our prior odds: 


OWx 


P(P\^) 

p(d\h 2 ) 


-x 2.23 = 

37,000 


223 

370,000 


This result shows that H 2 is about 1,659 times more likely than Hi. Finally, you can relax, knowing 
that a visit to the doctor in the morning for a simple ear cleaning will likely clear all this up! 



In everyday reasoning, it’s easy to overestimate the probability of scary situations, but by using 
Bayesian reasoning, we can break down the real risks and see how likely they actually are. 

WRAPPING UP 

In this chapter, you learned how to use the Bayes factor and posterior odds to compare two 
hypotheses. Rather than focusing on providing data to support our beliefs, the Bayes factor tests 
how well our beliefs support the data we've observed. The result is a ratio that reflects how many 
times better one hypothesis explains the data than the other. We can use it to strengthen our prior 
beliefs when they explain the data better than alternative beliefs. On the other hand, when the 
result is a fraction, we might want to consider changing our minds. 

EXERCISES 

Try answering the following questions to see how well you understand the Bayes factor and 
posterior odds. The solutions can be found at https://nostarch.com/learnbaves/ . 

1. Returning to the dice problem, assume that your friend made a mistake and suddenly 
realized that there were, in fact, two loaded dice and only one fair die. How does this change 
the prior, and therefore the posterior odds, for our problem? Are you more willing to believe 
that the die being rolled is the loaded die? 

2. Returning to the rare diseases example, suppose you go to the doctor, and after 
having your ears cleaned you notice that your symptoms persist. Even worse, you have a 
new symptom: vertigo. The doctor proposes another possible explanation, labyrinthitis, 
which is a viral infection of the inner ear in which 98 percent of cases involve vertigo. 
However, hearing loss and tinnitus are less common in this disease; hearing loss occurs only 
30 percent of the time, and tinnitus occurs only 28 percent of the time. Vertigo is also a 
possible symptom of vestibular schwannoma, but occurs in only 49 percent of cases. In the 
general population, 35 people per million contract labyrinthitis annually. What is the 
posterior odds when you compare the hypothesis that you have labyrinthitis against the 
hypothesis that you have vestibular schwannoma? 
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BAYESIAN REASONING IN THE TWILIGHT ZONE 



In Chapter 16, we used the Bayes factor and posterior odds to find out how many times better one 
hypothesis was than a competing one. But these tools of Bayesian reasoning can do even more than 
just compare ideas. In this chapter, we'll use the Bayes factor and posterior odds to quantify how 
much evidence it should take to convince someone of a hypothesis. We’ll also see how to estimate 
the strength of someone else’s prior belief in a certain hypothesis. We’ll do all of this using a famous 
episode of the classic TV series The Twilight Zone. 

BAYESIAN REASONING IN THE TWILIGHT ZONE 

One of my favorite episodes of The Twilight Zone is called "The Nick of Time." In this episode, a 
young, newly married couple, Don and Pat, wait in a small-town diner while a mechanic repairs 
their car. In the diner, they come across a fortune-telling machine called the Mystic Seer that 
accepts yes or no questions and, for a penny, spits out cards with answers to each question. 

Don, who is very superstitious, asks the Mystic Seer a series of questions. When the machine 
answers correctly, he begins to believe in its supernatural powers. However, Pat remains skeptical 
of the machine’s powers, even as the Seer continues to provide correct answers. 

Although Don and Pat are looking at the same data, they come to different conclusions. How can we 
explain why they reason differently when given the same evidence? We can use the Bayes factor to 
get deeper insight into how these two characters are thinking about the data. 

USING THE BAYES FACTOR TO UNDERSTAND THE MYSTIC 
SEER 

In the episode, we are faced with two competing hypotheses. Let’s call them H and H (or "not H ”), 
since one hypothesis is the negation of the other: 

H The Mystic Seer truly can predict the future. 

H The Mystic Seer just got lucky. 

Our data, D, in this case is the sequence of n correct answers the Mystic Seer provides. The 
greater n is, the stronger the evidence in favor of H. The major assumption in the Twilight 
Zonee pisode is that the Mystic Seer is correct every time, so the question is: is this result 
supernatural, or is it merely a coincidence? For us, D, our data, always represents a sequence 


of n correct answers. Now we can assess our likelihoods, or the probability of getting our data given 
each hypothesis. 

P[D | H) is the probability of getting n correct answers in a row given that the Mystic Seer can 
predict the future. This likelihood will always be 1, no matter the number of questions asked. This is 
because, if the Mystic Seer is supernatural, it will always pick the right answer, whether it is asked 
one question or a thousand. Of course, this also means that if the Mystic Seer gets a single answer 
wrong, the probability for this hypothesis will drop to 0, because a psychic machine wouldn’t ever 
guess incorrectly. In that case, we might want to come up with a weaker hypothesis—for example, 
that the Mystic Seer is correct 90 percent of the time (we'll explore a similar problem in Chapter 
19). 

P{D I") is the probability of getting n correct answers in a row if the Mystic Seer is randomly 
spitting out answers. Here, P[D \H)is 0.5». In other words, if the machine is just guessing, then 
each answer has a 0.5 chance of being correct. 

To compare these hypotheses, let’s look at the ratio of the two likelihoods: 

p(d\h) 

P{D\H) 

As areminder, this ratio measures how many times more likely the data is, given H as opposed 
to H , when we assume both hypotheses are equally likely. Now let’s see how these ideas compare. 

Measuring the Bayes Factor 

As we did in the preceding chapter, we'll temporarily ignore the ratio of our prior odds and 
concentrate on comparing the ratio of the likelihoods, or the Bayes factor. We’re assuming (for the 
time being) that the Mystic Seer has an equal chance of being supernatural as it does of being 
simply lucky. 

In this example, our numerator, P[D \ H), is always 1, so for any value of n we have: 

Dr P(D. I H) 1 

P(D n | H) 0.5" 

Let’s imagine the Mystic Seer has given three correct answers so far. At this point, P(D 3 \ H] - 1, 
and P[D \ H] = 0.5^ = 0.125. Clearly H explains the data better, but certainly nobody—not even 
superstitious Don—will be convinced by only three correct guesses. Assuming the prior odds are 
the same, our Bayes factor for three questions is: 

BF = —— = 8 

0.125 

We can use the same guidelines we used for evaluating posterior odds in Table 16-1 to evaluate 
Bayes factors here (if we assume each hypothesis is equally likely), as shown in Table 17-1. As you 
can see, a Bayes factor (BF) of 8 is far from conclusive. 


Table 17-1: Guidelines for Evaluating Bayes Factors 


BF 

Strength of evidence 

lto 3 

Interesting, but nothing conclusive 



BF 

Strength of evidence 

3 to 20 

Looks like we’re on to something 

20 to 150 

Strong evidence in favor of Hi 

> 150 

Overwhelming evidence in favor of Hi 


So, at three questions answered correctly and with BF = 8, we should at least be curious about the 
power of the Mystic Seer, though we shouldn’t be convinced yet. 

But by this point in the episode, Don already seems pretty sure that the Mystic Seer is psychic. It 
takes only four correct answers for him to feel certain of it. On the other hand, it takes 14 questions 
for Pat to even start considering the possibility seriously, resulting in a Bayes factor of 16,384—way 
more evidence than she should need. 

Calculating the Bayes factor doesn’t explain why Don and Pat form different beliefs about the 
evidence, though. What’s going on there? 


Accounting for Prior Beliefs 


The element missing in our model is each character’s prior belief in the hypotheses. Remember that 
Don is extremely superstitious, while Pat is a skeptic. Clearly, Don and Pat are using extra 
information in their mental models, because each of them arrives at a conclusion of a different 
strength, and at very different times. This is fairly common in everyday reasoning: two people often 
respond differently to the exact same facts. 


We can model this phenomenon by simply imagining the initial odds of P[H] and P[H ) given no 
additional information. We call this the prior odds ratio, as you saw in Chapter 16: 


prior odds = O(H) = 


P(H) 

p{h) 


The concept of prior beliefs in relation to the Bayes factor is actually pretty intuitive. Say we walk 
into the diner from The Twilight Zone, and I ask you, "What are the odds that the Mystic Seer is 
psychic?" You might reply, "Uh, one in a million! There’s no way that thing is supernatural." 
Mathematically, we can express this as: 

1 

1,000,000 



Now let’s combine this prior belief with our data. To do this, we’ll multiply our prior odds with the 
results of the likelihood ratio to get our posterior odds for the hypothesis, given the data we’ve 
observed: 


posterior odds = 0(H \ D) = 0(//)x 


P(D\H) 

P(D\H) 


Thinking there’s only a one in a million chance the Mystic Seer is psychic before looking at any 
evidence is pretty strong skepticism. The Bayesian approach reflects this skepticism quite well. If 
you think the hypothesis that the Mystic Seer is supernatural is extremely unlikely from the start, 



then you’ll require significantly more data to be convinced otherwise. Suppose the Mystic Seer gets 
five answers correct. Our Bayes factor then becomes: 

bf = — T = 32 
0.5 

A Bayes factor of 32 is a reasonably strong belief that the Mystic Seer is truly supernatural. 

However, if we add in our very skeptical prior odds to calculate our posterior odds, we get the 
following results: 

, , v P(DAH) 1 1 

posterior odds = 0(H | Z),) x —--=— =-x-- — 0.000032 

^ p(d s \h) 1,000,000 0.5 s 

Now our posterior odds tell us it’s extremely unlikely that the machine is psychic. This result 
corresponds quite well with our intuition. Again, if you really don’t believe in a hypothesis from the 
start, it’s going to take a lot of evidence to convince you otherwise. 

In fact, if we work backward, posterior odds can help us figure out how much evidence we’d need to 
make you believe H. At a posterior odds of 2, you’d just be starting to consider the supernatural 
hypothesis. So, if we solve for a posterior odds of greater than 2, we can determine what it would 
take to convince you. 

- - - X — >2 

1,000,000 0.5" 

If we solve for n to the nearest whole number, we get: 
n > 21 

At 21 correct answers in a row, even a strong skeptic should start to think that the Seer may, in fact, 
be psychic. 

Thus, our prior odds can do much more than tell us how strongly we believe something given our 
background. It can also help us quantify exactly how much evidence we would need to be convinced 
of a hypothesis. The reverse is true, too; if, after 21 correct answers in a row, you find yourself 
believing strongly in H, you might want to weaken your prior odds. 


DEVELOPING OUR OWN PSYCHIC POWERS 


At this point, we’ve learned how to compare hypotheses and calculate how much favorable 
evidence it would take to convince us of H, given our prior belief in H. Now we'll look at one more 
trick we can do with posterior odds: quantifying Don and Pat’s prior beliefs based on their 
reactions to the evidence. 


We don’t know exactly how strongly Don and Pat believe in the possibility that the Mystic Seer is 
psychic when they first walk into the diner. But we do know it takes Don about seven correct 
questions to become essentially certain of the Mystic Seer’s supernatural abilities. We can estimate 
that at this point Don’s posterior odds are 150—the threshold for very strong beliefs, according 
to Table 17-1. Now we can write out everything we know, except for 0[H], which we’ll be solving 
for: 


150 = O(H) x 


P(D, I H) 

p (A I H) 


= 0(H)x 




Solving this for 0(7/) gives us: 
0(7/) Don =1.17 


What we have now is a quantitative model for Don’s superstitious beliefs. Because his initial odds 
ratio is greater than 1, Don walks into the diner being slightly more willing than not to believe that 
the Mystic Seer is supernatural, before collecting any data at all. This makes sense, of course, given 
his superstitious nature. 

Now on to Pat. At around 14 correct answers, Pat grows nervous, calling the Mystic Seer "a stupid 
piece of junk!” Although she has begun to suspect that the Mystic Seer might be psychic, she’s not 
nearly as certain as Don. I would estimate that her posterior odds are 5—the point at which she 
might start thinking, "Maybe the Mystic Seer could have psychic powers ...” Now we can create the 
posterior odds for Pat’s beliefs in the same way: 


5 = 0(H)x 


Hdj I tf) 

P{Di<\n) 


= O(H) X 


1 

0.5 14 


When we solve for 0(7/), we can model Pat’s skepticism as: 


0(7/) Pat = 0.0003 


In other words, Pat, walking into the diner, would claim that the Seer has about a 1 in 3,000 chance 
of being supernatural. Again, this corresponds to our intuition; Pat begins with the very strong 
belief that the fortune-telling machine is nothing more than a fun game to play while she and Don 
wait for food. 

What we’ve done here is remarkable. We’ve used our rules of probability to come up with a 
quantitative statement about what someone believes. In essence, we have become mind readers! 

WRAPPING UP 

In this chapter, we explored three ways of using Bayes factors and posterior odds in order to reason 
about problems probabilistically. We started by revisiting what we learned in the previous chapter: 
that we can use posterior odds as a way to compare two ideas. Then we saw that if we know our 
prior belief in the odds of one hypothesis versus another, we can calculate exactly how much 
evidence it will take to convince us that we should change our beliefs. Finally, we used posterior 
odds to assign a value for each person’s prior beliefs by looking at how much evidence it takes to 
convince them. In the end, posterior odds is far more than just a way to test ideas. It provides us 
with a framework for thinking about reasoning under uncertainty. 

You can now use your own "mystic" powers of Bayesian reasoning to answer the exercises below: 



EXERCISES 

Try answering the following questions to see how well you understand quantifying the amount of 
evidence it should take to convince someone of a hypothesis and estimating the strength of 
someone else’s prior belief. The solutions can be found at https://nostarch.com/learnbayes/. 

1. Every time you and your friend get together to watch movies, you flip a coin to 
determine who gets to choose the movie. Your friend always picks heads, and every Friday 
for 10 weeks, the coin lands on heads. You develop a hypothesis that the coin has two heads 
sides, rather than both a heads side and a tails side. Set up a Bayes factor for the hypothesis 
that the coin is a trick coin over the hypothesis that the coin is fair. What does this ratio 
alone suggest about whether or not your friend is cheating you? 

2. Now imagine three cases: that your friend is a bit of a prankster, that your friend is 
honest most of the time but can occasionally be sneaky, and that your friend is very 
trustworthy. In each case, estimate some prior odds ratios for your hypothesis and compute 
the posterior odds. 

3. Suppose you trust this friend deeply. Make the prior odds of them cheating 1/10,000. 
How many times would the coin have to land on heads before you feel unsure about their 
innocence—say, a posterior odds of 1? 

4. Another friend of yours also hangs out with this same friend and, after only four 
weeks of the coin landing on heads, feels certain you’re both being cheated. This confidence 
implies a posterior odds of about 100. What value would you assign to this other friend’s 
prior belief that the first friend is a cheater? 
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WHEN DATA DOESN'T CONVINCE YOU 



In the previous chapter, we used Bayesian reasoning to reason about two hypotheses from an 
episode of The Twilight Zone-. 


H The fortune-telling Mystic Seer is supernatural. 

H The fortune-telling Mystic Seer isn’t supernatural, just lucky. 


We also learned how to account for skepticism by changing the prior odds ratio. For example, if 
you, like me, believe that the Mystic Seer definitely isn’t psychic, then you might want to set the 
prior odds extremely low—something like 1/1,000,000. 

However, depending on your level of personal skepticism, you might feel that even a 1/1,000,000 
odds ratio wouldn’t be quite enough to convince you of the seer’s power. 

Maybe even after receiving 1,000 correct answers from the seer—which, despite your very 
skeptical prior odds, would suggest you were astronomically in favor of believing the seer is 
psychic—you still wouldn’t buy into its supernatural powers. We could represent this by simply 
making our prior odds even more extreme, but I personally don’t find this solution very satisfying 
because no amount of data would convince me that the Mystic Seer is, in fact, psychic. 

In this chapter, we'll take a deeper look at problems where the data doesn’t convince people in the 
way we expect it to. In the real world, these situations are fairly common. Anyone who has argued 
with a relative over a holiday dinner has likely noticed that oftentimes the more contradictory 
evidence you give, the more they seem to be convinced of their preexisting belief! In order to fully 
understand Bayesian reasoning, we need to be able to understand, mathematically, why situations 
like these arise. This will help us identify and avoid them in our statistical analysis. 


A PSYCHIC FRIEND ROLLING DICE 


Suppose your friend tells you they can predict the outcome of a six-sided die roll with 90 percent 
accuracy because they are psychic. You find this claim difficult to believe, so you set up a hypothesis 
test using the Bayes factor. As in the Mystic Seer example, you have two hypotheses you want to 


compare: 



The first hypothesis, Hi, represents your belief that the die is fair, and that your friend is not 
psychic. If the die is fair, there is a 1 in 6 chance of guessing the result correctly. The second 
hypothesis, H 2 , represents your friend’s belief that they can, in fact, predict the outcome of a die roll 


90 percent of the time and is therefore given a 9/10 ratio. Next we need some data to start testing 
their claim. Your friend rolls the die 10 times and correctly guesses the outcome of the roll 9 times. 


Comparing Likelihoods 


As we often have in previous chapters, we’ll start by looking at the Bayes factor, assuming for now 
that the prior odds for each hypothesis are equal. We’ll formulate our likelihood ratio as: 


p(d \H t ) 

P(D\H l ) 


so that our results will tell us how many times better (or worse) your friend’s claim of being 
psychic explains the data than your hypothesis does. For this example, we’ll use the variable BFfor 
"Bayes factor" in our equations for brevity. Here is our result, taking into account the fact that your 
friend correctly predicted 9 out of 10 rolls: 

( 9 


BF = 



1 - 


V 


10 


= 468,517 


- x 1- 


Our likelihood ratio shows that the friend-being-psychic hypothesis explains the data 468,517 
times better than the hypothesis that your friend is just lucky. This is a bit concerning. According to 
the Bayes factor chart we saw in earlier chapters, this means we should be nearly certain that H 2 is 
true and your friend is psychic. Unless you’re already a deep believer in the possibility of psychic 
powers, something seems very wrong here. 


Incorporating Prior Odds 

In most cases in this book where the likelihood alone gives us strange results, we can solve the 
problem by including our prior probabilities. Clearly, we don’t believe in our friend’s hypothesis 
nearly as strongly as we believe in our own, so it makes sense to create a strong prior odds in favor 
of our hypothesis. We can start by simply setting our odds ratio high enough that it cancels out the 
extreme result of the Bayes factor, and see if this fixes our problem: 

o(h,) = —-— 

' ' 468,517 


Now, when we work out our full posterior odds, we find that we are, once again, unconvinced that 
your friend is psychic: 

posterior = 0( //„) x —\———d = 1 

V ’ P(D, Jtf.) 

For now, it looks like prior odds have once again saved us from a problem that occurred when we 
looked only at the Bayes factor. 

But suppose your friend rolls the die five more times and successfully predicts all five outcomes. 
Now we have a new set of data, Dis, which represents 15 rolls of a die, 14 of which your friend 
guessed accurately. Now when we calculate our posterior odds, we see that even our extreme prior 
is of little help: 
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= 4,592 


Using our existing prior, with just five more rolls of the die, we have posterior odds of 4,592— 
which means we’re back to being nearly certain that your friend is truly psychic! 

In most of our previous problems, we’ve corrected nonintuitive posterior results by adding a sane 
prior. We’ve added a pretty extreme prior against your friend being psychic, but our posterior odds 
are still strongly in favor of the hypothesis that they’re psychic. 

This is a major problem, because Bayesian reasoning should align with our everyday sense of logic. 
Clearly, 15 rolls of a die with 14 successful guesses is highly unusual, but it’s unlikely to convince 
many people that the guesser truly possesses psychic powers! However, if we can’t explain what’s 
going on here with our hypothesis test, it means that we really can’t rely on our test to solve our 
everyday statistical problems. 


Considering Alternative Hypotheses 


The issue here is that we don't want to believe your friend is psychic. If you found yourself in this 
situation in real life, it’s likely you would quickly come to some alternative conclusion. You might 
come to believe that your friend is using a loaded die that rolls a certain value about 90 percent of 
the time, for example. This represents a third hypothesis. Our Bayes factor is looking at only two 
possible hypotheses: Hi, the hypothesis that the die is fair, and H 2 , the hypothesis that your friend is 
psychic. 

Our Bayes factor so far tells us that it’s far more likely that our friend is psychic than that they are 
guessing the rolls of a fair die correctly. When we think of the conclusion in those terms, it makes 
more sense: with these results, it’s extremely unlikely that the die is fair. We don’t feel comfortable 
accepting the Hi alternative, because our own beliefs about the world don’t support the idea 
that Hi is a realistic explanation. 

It’s important to understand that a hypothesis test compares only two explanations for an event, 
but very often there are countless possible explanations. If the winning hypothesis doesn’t convince 
you, you could always consider a third one. 


Let’s look at what happens when we compare Hi, our winning hypothesis, with a new 
hypothesis, Hv. that the die is rigged so it has a certain outcome 90 percent of the time. 

We’ll start with a new prior odds about Hi, which we'll call 0[Hi)' (the tick mark is a common 
notation in mathematics meaning "like but not the same as”). This will represent the odds of Hi/Hi. 
For now, we’ll just say that we believe it’s 1,000 times more likely that your friend is using a loaded 
die than that your friend is really psychic (though our real prior might be much more extreme). 
That means the prior odds of your friend being psychic is 1/1,000. If we reexamine our new 
posterior odds, we get the following interesting result: 
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According to this calculation, our posterior odds are the same as our prior odds, 0[H 2 )'. This 
happens because our two likelihoods are the same. In other words, P[Dis \ H 2 ) = P[Dis \ Hi). For both 
hypotheses, the likelihood of your friend correctly guessing the outcome of the die roll is exactly the 
same for the loaded die because the probability each assigns to success is the same. This means that 
our Bayes factor will always be 1. 

These results correspond quite well to our everyday intuition; after all, prior odds aside, each 
hypothesis explains the data we’ve seen equally well. That means that if, before considering the 
data, we believe one explanation is far more likely than the other, then no amount of new evidence 
will change our minds. So we no longer have a problem with the data we observed; we’ve simply 
found a better explanation for it. 

In this scenario, no amount of data will change our mind about believing H 3 over H 2 because both 
explain what we’ve observed equally well, and we already think that H 3 is a far more likely 
explanation than H 2 . What’s interesting here is that we can find ourselves in this situation even if 
our prior beliefs are entirely irrational. Maybe you’re a strong believer in psychic phenomena and 
think that your friend is the most honest person on earth. In this case, you might make the prior 
odds 0[H 2 )' = 1,000. If you believed this, no amount of data could convince you that your friend is 
using a loaded die. 

In cases like this, it’s important to realize that if you want to solve a problem, you need to be willing 
to change your prior beliefs. If you’re unwilling to let go of unjustifiable prior beliefs, then, at the 
very least, you must acknowledge that you’re no longer reasoning in a Bayesian—or logical—way 
at all. We all hold irrational beliefs, and that’s perfectly okay, so long as we don’t attempt to use 
Bayesian reasoning to justify them. 

ARGUING WITH RELATIVES AND CONSPIRACY THEORISTS 

Anyone who has argued with relatives over a holiday dinner about politics, climate change, or their 
favorite movies has experienced firsthand a situation in which they are comparing two hypotheses 
that both explain the data equally well (to the person arguing), and only the prior remains. How can 
we change someone else’s (or our own) beliefs even when more data doesn’t change anything? 
We’ve already seen that if you compare the belief that your friend has a loaded die and the belief 
that they are psychic, more data will do nothing to change your beliefs about your friend’s claim. 
This is because both your hypothesis and your friend’s hypothesis explain the data equally well. In 
order for your friend to convince you that they are psychic, they have to alter your prior beliefs. For 
example, since you’re suspicious that the die might be loaded, your friend could then offer to let you 
choose the die they roll. If you bought a new die and gave it to your friend, and they continued to 
accurately predict their rolls, you might start to be convinced. This same logic holds anytime you 
run into a problem where two hypotheses equally explain the data. In these cases, you must then 
see if there’s anything you can change in your prior. 

Suppose after you purchase the new die for your friend and they continue to succeed, you stilldon't 
believe them; you now claim that they must have a secret way of rolling. In response, your friend 
lets you roll the die for them, and they continue to successfully predict the rolls—yet you still don’t 
believe them. In this scenario, something else is happening beyond just a hidden hypothesis. You 
now have an FA—that your friend is completely cheating—and you won’t change your mind. This 
means that for any D n , P[D n | FA) = 1. Clearly we’re out of Bayesian territory since you’ve essentially 
conceded that you won’t change your mind, but let’s see what happens mathematically if your 
friend persists in trying to convince you. 

Let’s look at how these two explanations, H 2 and FA, compete using our data D i0 with 9 correct 
predictions and 1 missed prediction. The Bayes factor for this is: 
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Because you refuse to believe anything other than that your friend is cheating, the probability of 
what you observe is, and will always be, 1. Even though the data is exactly as we would expect in 
the case of your friend being psychic, we find our beliefs explain the data 26 times as well. Your 
friend, deeply determined to change your stubborn mind, persists and rolls 100 times, getting 90 
guesses right and 10 wrong. Our Bayes factor shows something very strange that happens: 
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Even though the data seems to strongly support your friend’s hypothesis, because you refuse to 
budge in your beliefs, you’re now even more wildly convinced that you’re right! When we don’t 
allow our minds to be changed at all, more data only further convinces us we are correct. 

This pattern may seem familiar to anyone who has argued with a politically radical relative or 
someone who adamantly believes in a conspiracy theory. In Bayesian reasoning, it is vital that our 
beliefs are at least falsifiable. In traditional science, falsifiability means that something can be 
disproved, but in our case it just means there has to be some way to reduce our belief in a 
hypothesis. 

The danger of nonfalsifiable beliefs in Bayesian reasoning isn’t just that they can’t be proved 
wrong—it’s that they are strengthened even by evidence that seems to contradict them. Rather 
than persisting in trying to convince you, your friend should have first asked, "What can I show you 
that would change your mind?” If your reply had been that nothing could change your mind, then 
your friend would be better off not presenting you with more evidence. 

So, the next time you argue with a relative over politics or conspiracy theories, you should ask 
them: "What evidence would change your mind?" If they have no answer to this, you’re better off 
not trying to defend your views with more evidence, as it will only increase your relative’s certainty 
in their belief. 


WRAPPING UP 

In this chapter, you learned about a few ways hypothesis tests can go wrong. Although the Bayes 
factor is a competition between two ideas, it’s quite possible that there are other, equally valid, 
hypotheses worth testing out. 

Other times, we find that two hypotheses explain the data equally well; you’re just as likely to see 
your friend’s correct predictions if they were caused by your friend’s psychic ability or a trick in the 
die. When this is the case, only the prior odds ratio for each hypothesis matters. This also means 
that acquiring more data in those situations will never change our beliefs, because it will never give 
either hypothesis an edge over the other. In these cases, it’s best to consider how you can alter the 
prior beliefs that are affecting the results. 

In more extreme cases, we might have a hypothesis that simply refuses to be changed. This is like 
having a conspiracy theory about the data. When this is the case, not only will more data never 
convince us to change our beliefs, but it will actually have the opposite effect. If a hypothesis is not 
falsifiable, more data will only serve to make us more certain of the conspiracy. 



EXERCISES 

Try answering the following questions to see how well you understand how to deal with extreme 
cases in Bayesian reasoning. The solutions can be found at https://nostarch.com/learnbayes/ . 

1. When two hypotheses explain the data equally well, one way to change our minds is 
to see if we can attack the prior probability. What are some factors that might increase your 
prior belief in your friend’s psychic powers? 

2. An experiment claims that when people hear the word Florida, they think of the 
elderly and this has an impact on their walking speed. To test this, we have two groups of 15 
students walk across a room; one group hears the word Florida and one does not. 

Assume Hi = the groups don’t move at different speeds, and H 2 = the Florida group is slower 
because of hearing the word Florida. Also assume: 

nr n°\»t) 

P(D\H t ) 

The experiment shows that H 2 has a Bayes factor of 19. Suppose someone is unconvinced by 
this experiment because H 2 had a lower prior odds. What prior odds would explain someone 
being unconvinced and what would the BF need to be to bring the posterior odds to 50 for 
this unconvinced person? 

Now suppose the prior odds do not change the skeptic’s mind. Think of an alternate H 3 that 
explains the observation that the Florida group is slower. Remember if H 2 and H 3 both 
explain the data equally well, only prior odds in favor of H 3 would lead someone to claim H 3 is 
true over H 2 , so we need to rethink the experiment so that these odds are decreased. Come 
up with an experiment that could change the prior odds in H 3 over H 2 . 
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FROM HYPOTHESIS TESTING TO PARAMETER ESTIMATION 



So far, we’ve used posterior odds to compare only two hypotheses. That’s fine for simple problems; 
even if we have three or four hypotheses, we can test them all by conducting multiple hypothesis 
tests, as we did in the previous chapter. But sometimes we want to search a really large space of 
possible hypotheses to explain our data. For example, you might want to guess how many jelly 
beans are in a jar, the height of a faraway building, or the exact number of minutes it will take for a 
flight to arrive. In all these cases, there are many, many possible hypotheses—too many to conduct 
hypothesis tests for all of them. 

Luckily, there’s a technique for handling this scenario. In Chapter 15. we learned how to turn a 
parameter estimation problem into a hypothesis test. In this chapter, we’re going to do the 
opposite: by looking at a virtually continuous range of possible hypotheses, we can use the Bayes 
factor and posterior odds (a hypothesis test) as a form of parameter estimation! This approach 
allows us to evaluate more than just two hypotheses and provides us with a simple framework for 
estimating any parameter. 


IS THE CARNIVAL GAME REALLY FAIR? 


Suppose you’re at a carnival. While walking through the games, you notice someone arguing with a 
carnival attendant near a pool of little plastic ducks. Curious, you get closer and hear the player 
yelling, "This game is rigged! You said there was a 1 in 2 chance of getting a prize and I've picked up 
20 ducks and only received one prize! It looks to me like the chance of getting a prize is only 1 in 


20 !" 


Now that you have a strong understanding of probability, you decide to settle this argument 
yourself. You explain to the attendant and the angry customer that if you observe some more games 
that day, you'll be able to use the Bayes factor to determine who’s right. You decide to break up the 
results into two hypotheses: Hi, which represents the attendant’s claim that the probability of a 
prize is 1/2, and H 2 , the angry customer’s claim that the probability of a prize is just 1/20: 




20 



The attendant argues that because he didn’t watch the customer pick up ducks, he doesn’t think you 
should use his reported data, since no one else can verify it. This seems fair to you. You decide to 
watch the next 100 games and use that as your data instead. After the customer has picked up 100 
ducks, you observe that 24 of them came with prizes. 

Now, on to the Bayes factor! Since we don’t have a strong opinion about the claim from either the 
customer or the attendant, we won’t worry about the prior odds or calculating our full posterior 
odds yet. 

To get our Bayes factor, we need to compute P[D \ H] for each hypothesis: 

P[D | Hi) = (0.5) 24 x (1 - 0.5)* 

P[D | Hz) = (0.05)2 4 x (1 - 0.05)26 

Now, individually, both of these probabilities are quite small, but all we care about is the ratio. We’ll 
look at our ratio in terms of H 2 /Hi so that our result will tell us how many times better the 
customer’s hypothesis explains the data than the attendant’s: 

P(D\H l ) 653 

Our Bayes factor tells us that Hi, the attendant’s hypothesis, explains the data 653 times as well 
as Hz, which means that the attendant’s hypothesis (that the probability of getting a prize when 
picking up a duck is 0.5) is the more likely one. 

This should immediately seem strange. Clearly, the probability of getting only 24 prizes out of a 
total of 100 ducks seems really unlikely if the true probability of a prize is 0.5. We can use 
R’s pbinomO function (introduced in Chapter to calculate the binomial distribution, which will 
tell us the probability of seeing 24 or fewer prizes, assuming that the probability of getting a prize is 
really 0.5: 


> pbinom(24,100,0.5) 

9.050013e-08 


As you can see, the probability of getting 24 or fewer prizes if the true probability of a prize is 0.5 is 
extremely low; expanding it out to the full decimal values, we get a probability of 
0.00000009050013! Something is definitely up with Hu Even though we don’t believe the 
attendant’s hypothesis, it still explains the data much better than the customer’s. 


So what’s missing? In the past, we’ve often found that the prior probability usually matters a lot 
when the Bayes factor alone doesn’t give us an answer that makes sense. But as we saw in Chapter 
18. there are cases in which the prior isn’t the root cause of our problem. In this case, using the 
following equation seems reasonable, since we don’t have a strong opinion either way: 
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But maybe the problem here is that you have a preexisting mistrust in carnival games. Because the 
result of the Bayes factor favors the attendant’s hypothesis so strongly, we’d need our prior odds to 
be at least 653 to get a posterior odds that favors the customer’s hypothesis: 






That’s a really deep distrust of the fairness of the game! There must be some problem here other 
than the prior. 


Considering Multiple Hypotheses 


One obvious problem is that, while it seems intuitively clear that the attendant is wrong in his 
hypothesis, the customer’s alternative hypothesis is just too extreme to be right, either, so we have 
two wrong hypotheses. What if the customer thought the probability of winning was 0.2, rather 
than 0.05? We’ll call this hypothesis H 3 . Testing H 3 against the attendant’s hypothesis radically 
changes the results of our likelihood ratio: 


¥ = 


P(D\H t ) (0.2) 24 x (1 - 0.2)" 
P(D\h') “(0.5) 24 x (1-0.5)’ 


= 917,399 


Here we see that H 3 explains the data wildly better than Hi. With a Bayes factor of 917,399, we can 
be certain that Hi is far from the best hypothesis for explaining the data we’ve observed, 
because H 3 blows it out of the water. The trouble we had in our first hypothesis test was that the 
customer’s belief was a far worse description of the event than the attendant’s belief. As we can see, 
though, that doesn’t mean the attendant was right. When we came up with an alternative 
hypothesis, we saw that it was a much better guess than either the attendant’s or the customer’s. 

Of course, we haven’t really solved our problem. What if there’s an even better hypothesis out 
there? 


Searching for More Hypotheses with R 

We want a more general solution that searches all of our possible hypotheses and picks out the best 
one. To do this, we can use R’s seq() function to create a sequence of hypotheses we want to 
compare to our Hi. 

We’ll consider every increment of 0.01 between 0 and 1 as a possible hypothesis. That means we’ll 
consider 0.01, 0.02, 0.03, and so on. We’ll call 0.01—the amount we’re increasing each hypothesis 
by —dx (a common notation from calculus representing the "smallest change”) and use it to define 
a hypotheses variable, which represents all of the possible hypotheses we want to consider. Here we 
use R’s seq() function to generate a range of values for each hypothesis between 0 and 1 by 
incrementing the values by our dx: 


dx <- 0.01 

hypotheses <- seq(0,l,by=dx) 


Next, we need a function that can calculate our likelihood ratio for any two hypotheses. 

Our bayes.factor() function will take two arguments: h_top, which is the probability of getting a prize 
for the hypothesis on the top (the numerator) and h_bottom, which is the hypothesis we’re competing 
against (the attendant’s hypothesis). We set this up like so: 





bayes.factor <- function(h_top,h_bottom){ 
((h_top) A 24*(l-h_top) A 76)/((h_bottom) A 24*(l-h_bottom) A 76) 

} 


Finally, we compute the likelihood ratio for all of these possible hypotheses: 


bfs <- bayes.factor(hypotheses,0.5) 


Then, we use R’s base plotting functionality to see what these likelihood ratios look like: 


plot(hypotheses,bfs, type=T) 


Figure 19-1 shows the resulting plot. 



Hypotheses 

Figure 19-1: Plotting the Bayes factor for each of our hypotheses 

Now we can see a clear distribution of different explanations for the data we’ve observed. Using R, 
we can look at a wide range of possible hypotheses, where each point in our line represents the 
Bayes factor for the corresponding hypothesis on the x-axis. 

We can also see how high the largest Bayes factor is by using the max() function with our vector 
of bfs: 


> max(bfs) 

1.47877610 A {6} 


Then we can check which hypothesis corresponds to the highest likelihood ratio, telling us which 
hypothesis we should believe in the most. To do this, enter: 











> hypotheses[which.max(bfs)] 

0.24 


Now we know that a probability of 0.24 is our best guess, since this hypothesis produces the 
highest likelihood ratio when compared with the attendant’s. In Chapter 10. you learned that using 
the mean or expectation of our data is often a good way to come up with a parameter estimate. 

Here we’ve simply chosen the hypothesis that individually explains the data the best, because we 
don’t currently have a way to weigh our estimates by their probability of occurring. 

Adding Priors to Our Likelihood Ratios 

Now suppose you present your findings to the customer and the attendant. Both agree that your 
findings are pretty convincing, but then another person walks up to you and says, "I used to make 
games like these, and I can tell you that for some strange industry reason, the people who design 
these duck games never put the prize rate between 0.2 and 0.3. I’d bet you the odds are 1,000 to 1 
that the real prize rate is not in this range. Other than that, I have no clue." 

Now we have some prior odds that we’d like to use. Since the former game maker has given us 
some solid odds about his prior beliefs in the probability of getting a prize, we can try to multiply 
this by our current list of Bayes factors and compute the posterior odds. To do this, we create a list 
of prior odds ratios for every hypothesis we have. As the former game maker told us, the prior odds 
ratio for all probabilities between 0.2 and 0.3 should be 1/1,000. Since the maker has no opinion 
about other hypotheses, the odds ratio for these will just be 1. We can use a simple ifelsestatement, 
using our vector of hypotheses, to create a vector of our odds ratios: 


priors <- ifelse(hypotheses >= 0.2 & hypotheses <= 0.3,1/1000,1) 


Then we can once again use plotQ to display this distribution of priors: 


plot(hypotheses, priors, type=T) 

Figure 19-2 shows our distribution of prior odds. 

Because R is a vector-based language (for more information on this, see Appendix A T we can simply 
multiply our priors by our bfs and get a new vector of posteriors representing our Bayes factors: 


posteriors <- priors*bfs 


Finally, we can plot a chart of the posterior odds of each of our many hypotheses: 


plot(hypotheses, posteriors, type=T) 


Figure 19-3 shows the plot. 
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Hypotheses 


Figure 19-2: Visualizing our prior odds ratios 
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Figure 19-3: Plotting our distribution of Bayes factors 

As we can see, we get a very strange distribution of possible beliefs. We have reasonable confidence 
in the values between 0.15 and 0.2 and between 0.3 and 0.35, but find the range between 0.2 and 




0.3 to be extremely unlikely. But this distribution is an honest representation of the strength of 
belief in each hypothesis, given what we’ve learned about the duck game manufacturing process. 
While this visualization is helpful, we really want to be able to treat this data like a true probability 
distribution. That way, we can ask questions about how much we believe in ranges of possible 
hypotheses and calculate the expectation of our distribution to get a single estimate for what we 
believe the hypothesis to be. 

BUILDING A PROBABILITY DISTRIBUTION 

A true probability distribution is one where the sum of all possible beliefs equals 1. Having a 
probability distribution would allow us to calculate the expectation (or mean) of our data to make a 
better estimate about the true rate of getting a prize. It would also allow us to easily sum ranges of 
values so we could come up with confidence intervals and other similar estimates. 

The problem is that if we add up all the posterior odds for our hypotheses, they don’t equal 1, as 
shown in this calculation: 


> sum(posteriors) 

3.140687510 A {6) 


This means we need to normalize our posterior odds so that they do sum to 1. To do so, we simply 
divide each value in our posteriors vector by the sum of all the values: 


p.posteriors <- posteriors/sum(posteriors) 


Now we can see that our p.posteriors values add up to 1: 


> sum(p.posteriors) 

1 


Finally, let’s plot our new p.posteriors: 


plot(hypotheses,p. posteriors, type=T) 


Figure 19-4 shows the plot. 
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Figure 19-4: Our normalized posterior odds (note the scale on they-axis) 

We can also use our p.posteriors to answer some common questions we might have about our data. 
For example, we can now calculate the probability that the true rate of getting a prize is less than 
what the attendant claims. We just add up all the probabilities for values less than 0.5: 


sum(p.posteriors[which(hypotheses < 0.5)]) 
> 0.9999995 


As we can see, the probability that the prize rate is lower than the attendant’s hypothesis is nearly 
1. That is, we can be almost certain that the attendant is overstating the true prize rate. 

We can also calculate the expectation of our distribution and use this result as our estimate for the 
true probability. Recall that the expectation is just the sum of the estimates weighted by their value: 


> sum(p.posteriors*hypotheses) 
0.2402704 


Of course, we can see our distribution is a bit atypical, with a big gap in the middle, so we might 
want to simply choose the most likely estimate, as follows: 


> hypotheses [which.max(p.posteriors)] 
0.19 


Now we’ve used the Bayes factor to come up with a range of probabilistic estimates for the true 
possible rate of winning a prize in the duck game. This means that we’ve used the Bayes factor as a 
form of parameter estimation! 










FROM THE BAYES FACTOR TO PARAMETER ESTIMATION 

Let’s take a moment to look at our likelihood ratios alone again. When we weren’t using a prior 
probability for any of the hypotheses, you might have felt that we already had a perfectly good 
approach to solving this problem without needing the Bayes factor. We observed 24 ducks with 
prizes and 76 ducks without prizes. Couldn't we just use our good old beta distribution to solve this 
problem? As we’ve discussed many times since Chapter 5. if we want to estimate the rate of some 
event, we can always use the beta distribution. Figure 19-5 shows a plot of a beta distribution with 
an alpha of 24 and a beta of 76. 

Beta(24,76) for our hypotheses 



Except for the scale of the y-axis, the plot looks nearly identical to the original plot of our likelihood 
ratios! In fact, if we do a few simple tricks, we can get these two plots to line up perfectly. If we scale 
our beta distribution by the size of our dx and normalize our bfs, we can see that these two 
distributions get quite close f Figure 19-6 4 




































Beta(24,76) scaled compared to our likelihood ratios normalized 



There seems to be only a slight difference now. We can fix it by using the weakest prior that 
indicates that getting a prize and not getting a prize are equally likely—that is, by adding 1 to both 
the alpha and beta parameters, as shown in Figure 19-7 . 


























Beta(24+1,76+1) scaled compared to our likelihood ratios normalized 



Now we can see that the two distributions are perfectly aligned. Chapter 5 mentioned that the beta 
distribution was difficult to derive from our basic rules of probability. However, by using the Bayes 
factor, we’ve been able to empirically re-create a modified version of it that assumes a prior of 
Beta(l,l). And we did it without any fancy mathematics! All we had to do was: 

1. Define the probability of the evidence given a hypothesis. 

2. Consider all possible hypotheses. 

3. Normalize these values to create a probability distribution. 

Every time we’ve used the beta distribution in this book, we've used a beta-distributed prior. This 
made the math easier, since we can arrive at the posterior by combining the alpha and beta 
parameters from the likelihood and prior beta distributions. In other words: 

Beta(a posterior^ |3posterior ) — Beta(a p rior -r CXlikelihood, (Bprior "T |3likelihood^) 

However, by building our distribution from the Bayes factor, we were able to easily use a unique 
prior distribution. Not only is the Bayes factor a great tool for setting up hypothesis tests, but, as it 
turns out, it’s also all we need to create any probability distribution we might want to use to solve 
our problem, whether that’s hypothesis testing or parameter estimation. We just need to be able to 
define the basic comparison between two hypotheses, and we’re on our way. 

When we built our A/B test in Chapter 15. we figured out how to reduce many hypothesis tests to a 
parameter estimation problem. Now you’ve seen how the most common form of hypothesis testing 

































can also be used to perform parameter estimation. Given these two related insights, there is 
virtually no limit to the type of probability problems we can solve using only the most basic rules of 
probability. 

WRAPPING UP 

Now that you’ve finished your journey into Bayesian statistics, you can appreciate the true beauty 
of what you've been learning. From the basic rules of probability, we can derive Bayes’ theorem, 
which lets us convert evidence into a statement expressing the strength of our beliefs. From Bayes’ 
theorem, we can derive the Bayes factor, a tool for comparing how well two hypotheses explain the 
data we’ve observed. By iterating through possible hypotheses and normalizing the results, we can 
use the Bayes factor to create a parameter estimate for an unknown value. This, in turn, allows us to 
perform countless other hypothesis tests by comparing our estimates. And all we need to do to 
unlock all this power is use the basic rules of probability to define our likelihood, P[D \ H]\ 

EXERCISES 

Try answering the following questions to see how well you understand using the Bayes factor and 
posterior odds to do parameter estimation. The solutions can be found 
at https://nostarch.com/learnbayes/ . 

1. Our Bayes factor assumed that we were looking at Hu P[ prize) = 0.5. This allowed us 
to derive a version of the beta distribution with an alpha of 1 and a beta of 1. Would it matter 
if we chose a different probability for Hi? Assume Hu P( prize) = 0.24, then see if the resulting 
distribution, once normalized to sum to 1, is any different than the original hypothesis. 

2. Write a prior for the distribution in which each hypothesis is 1.05 times more likely 
than the previous hypothesis (assume our dx remains the same). 

3. Suppose you observed another duck game that included 34 ducks with prizes and 66 
ducks without prizes. How would you set up a test to answer "What is the probability that 
you have a better chance of winning a prize in this game than in the game we used in our 
example?” Implementing this requires a bit more sophistication than the R used in this book, 
but see if you can learn this on your own to kick off your adventures in more advanced 
Bayesian statistics! 
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A QUICK INTRODUCTION TO R 



In this book, we use the R programming language to do some tricky mathematical work for us. R is 
a programming language that specializes in statistics and data science. If you don’t have experience 
with R, or with programming in general, don’t worry—this appendix will get you started. 

RAND RSTUDIO 

To run the code examples in this book, you'll need to have R installed on your computer. To install 
R, visit https://cran.rstudio.com/ and follow the installation steps for the operating system you’re 
using. 

Once you’ve installed R, you should also install RStudio, an integrated development environment 
(IDE) that makes it extremely easy to run R projects. Download and install RStudio 
from www.rstudio.com/products/rstudio/download/ . 

When you open RStudio, you should be greeted with several panels f Figure A-1T 






Figure A-l: Viewing the console in RStudio 

The most important panel is the big one in the middle, called the console. In the console, you can 
enter any of the code examples from the book and run them simply by pressing ENTER. The console 
runs all the code you enter immediately, which makes it hard to keep track of the code you’ve 
written so far. 

To write programs that you can save and come back to, you can place your code in an R script, 
which is a text file that you can load into the console later. R is an extremely interactive 
programming language, so rather than thinking of the console as a place you can test out code, think 
of R scripts as a way to quickly load tools you can use in the console. 

CREATING AN R SCRIPT 

To create an R script, go to File ► New File ► R Script in RStudio. This should create a new blank 
panel in the top left ( Figure A-2 1. 




M • 


-/exampte.project - RStufro 


- cw - 

* 

• Addins * 

1 example .project 

O IMmMI 



_ ; environment History Connection*— 


Source on Save \ / - 

-■♦Run **■♦ 

■■ Sou'ce - T H ^ Import Dataset / Lnt * 

1 



Global environment • 


file* Hot* Packages Help Viewer 

Oj Hew FolOer o Delete - Rename & 

^ Home e*ample_pfOj*<t * — 

a Name 2 

t „ 

1 example, project Rproj 

1:1 (Top LeveO : R Script s 

Cofliok Terminal n “J 

■ /tumpk .project/ 

R IS free softoore ond cones Kith ABSOLUTELY NO WARRANTY. 

You ore nelcone to redistribute it under certoin conditions. 

Type 'licenscO' or ’licenceO" for distribution details. 

Notural language support but running in on English locale 

R is o collaborative project nith eany contributors. 

Type ‘contributorsO' for "ore infomation and 
'citationO' on ho* to cite R or R packages in publications. 

Type 'denoO' for sane denos, 'helpO' for on-line help, or 
‘help.stortO’ for an KTUl bronser interface to help. 

Type 'qO' to quit R. 


Figure A-2: Creating an R script 

In this panel, you can enter code and save it as a file. To run the code, simply click the Sourcebutton 
at the top right of the panel, or run individual lines by clicking the Run button. The Source button 
will automatically load your file into the console as though you had typed it there yourself. 

BASIC CONCEPTS IN R 

We'll be using R as an advanced calculator in this book, which means you’ll only need to understand 
a few basics to work through the problems and extend the examples in the book on your own. 

Data Types 

All programming languages have different types of data, which you can use for different purposes 
and manipulate in different ways. R has a rich variety of types and data structures, but we’ll only be 
using a very small number of them in this book. 

Doubles 

The numbers we use in R will all be of the type double (short for "double-precision floating-point," 
which is the most common way to represent decimal numbers on a computer). The double is the 
default type for representing decimal numbers. Unless otherwise specified, all numbers you enter 
into the console are of the double type. 



We can manipulate numbers in the double type using standard mathematical operations. For 
example, we can add two numbers with the + operator. Try this out in the console: 


>5 + 2 

[ 1 ] 7 


We can also divide any numbers that give us decimal results using the / operator: 


> 5/2 
[1] 2.5 


We can multiply values with the * operator like so: 


>5*2 

[ 1 ] 10 


and take the exponential of a value using the A operator. For example, 5^ is: 


> 5 A 2 
[1] 25 


We can also add - in front of a number to make it negative: 


> 5--2 

[1]7 


And we can also use scientific notation with e+. So 5 x 10^ is just: 


> 5e+2 

[1] 500 


If we use e- we get the same result as 5 x 10-2; 


> 5e-2 

[1] 0.05 


This is useful to know because sometimes R will return the result in scientific notation if it is too 
large to easily fit on the screen, like so: 


> 5*10 A 20 

[1] 5e+20 


Strings 

Another important type in R is the string, which is just a group of characters used to represent text. 
In R, we surround a string with quotation marks, like this: 



















> "hello" 
[1] "hello" 


Note that if you put a number inside a string, you can’t use that number in regular numeric 
operations because strings and numbers are different types. For example: 


> "2" + 2 

Error in "2" + 2 : non-numeric argument to binary operator 


We won’t be making much use of strings in this book. We'll primarily use them to pass arguments 
to functions and to give labels to plots. But it’s important to remember them if you’re using text. 

Logicals 

Logical or binary types are true or false values represented by the codes true and false. Note 
that true and false aren’t strings—they’re not surrounded by quotes, and they’re written in all 
uppercase. (R also allows you to simply use T or F instead of writing out the full words.) 

We can combine logical types with the symbols & ("and”) and | ("or") to perform basic logical 
operations. For example, if we wanted to know whether it’s possible for something to be both 
true and false at the same time, we might enter: 


> TRUE & FALSE 


R would return: 


[1] FALSE 


telling us that a value can’t be both true and false. 
But what about true or false? 


> TRUE | FALSE 

[1] TRUE 


Like strings, in this book logical values will primarily be used to provide arguments to functions 
we’ll be using, or as the results of comparing two different values. 

Missing Values 

In practical statistics and data science, data is often missing some values. For example, say you have 
temperature data for the morning and afternoon of every day for a month, but something 
malfunctioned one day and you’re missing a morning temperature. Because missing values are so 
common, R has a special way of representing them: using the value na. It’s important to have a way 
to handle missing values because they can mean very different things in different contexts. For 
example, when you’re measuring rainfall a missing value might mean there was no rain in the 
gauge, or it might mean that there was plenty of rain but temperatures were freezing that night, 
cracking the gauge and causing all the water to leak out. In the first case, we might consider missing 
values to mean 0, but in the latter case it’s not clear what the value should be. Keeping missing 
values separate from other values forces us to consider these differences. 












To prompt us to make sense of what our missing values are whenever we try to use one, R will 
output na for any operation using a missing value: 


> NA + 2 

[1] NA 


As we’ll see in a bit, various functions in R can handle missing values in different ways, but you 
shouldn’t have to worry about missing values for the R you’ll use in this book. 

Vectors 

Nearly every programming language contains certain features that make it unique and especially 
suited to solving problems in its domain. R’s special feature is that it is a vector language. A vector is 
a list of values, and everything R does is an operation on a vector. We use the code c(...) to define 
vectors (but even if we put in just a single value, R does this for us!). 

To understand how vectors work, let’s consider an example. Enter the next example in a script, 
rather than the console. We first create a new vector by assigning the variable x to the 
vector c(i,2,3) using the assignment operator <- like so: 


x <- c(l,2,3) 


Now that we have a vector, we can use it in our calculations. When we perform a simple operation, 
like adding 3 to x, when we enter this in the console, we get a rather unexpected result (especially if 
you’re used to another programming language): 


> x + 3 
[1] 45 6 


The result of x + 3 tells us what happens if we add 3 to each value in our x vector. (In many other 
programming languages, we’d need to use a for loop or some other iterator to perform this 
operation.) 

We can also add vectors to each other. Here, we’ll create a new vector containing three elements, 
each with a value of 2. We’ll name this vector y, then add y to x: 


>y<-c(2,2,2) 

> x + y 
[1] 3 4 5 


As you can see, this operation added each element in x to its corresponding element in y. 
What if we multiply our two vectors? 


> x * y 
[1] 2 4 6 


Each value in x was multiplied by its corresponding value in y. If the lists weren’t the same size, or a 
multiple of the same size, then we’d get an error. If a vector is a multiple of the same size, R will just 












repeatedly apply the smaller vector to the larger one. However, we won’t be making use of this 
feature in this book. 

We can quite easily combine vectors in R by defining another vector based on the existing ones. 
Here, we’ll create the vector z by combining x and y: 


> z <- c(x,y) 

> z 

[1] 1 2 3 2 2 2 


Notice that this operation didn’t give us a vector of vectors; instead, we got a single vector that 
contains the values from both, in the order you set x and y when you defined z. 

Learning to use vectors efficiently in R can be a bit tricky for beginners. Ironically, programmers 
who are experienced in a non-vector-based language often have the most difficulty. Don’t worry, 
though: in this book, we’ll use vectors to make reading code easier. 

FUNCTIONS 

Functions are blocks of code that perform a particular operation on a value, and we’ll use them in R 
to solve problems. 

In R and RStudio, all functions come equipped with documentation. If you enter ? followed by a 
function name into the R console, you’ll get the full documentation for that function. For example, if 
you enter ?sum into the RStudio console, you should see the documentation shown in Figure A-3 in 
the bottom-right screen. 






Figure A-3: Viewing the documentation for the sum() function 

This documentation gives us the definition of the sum() function and some of its uses. 

The sumOfunction takes a vector’s values and adds them all together. The documentation says it 
takes ... as an argument, which means it can accept any number of values. Usually these values will 
be a vector of numbers, but they can consist of multiple vectors, too. 

The documentation also lists an optional argument : na.rm = false. Optional arguments are arguments 
that you don’t have to pass in to the function for it to work; if you don’t pass an optional argument 
in, R will use the argument’s default value. In the case of na.rm, which automatically removes any 
missing values, the default value, after the equal sign, is false. That means that, by 
default, sum() won’t remove missing values. 

Basic Functions 

Here are some of R’s most important functions. 




The lengthQ and ncharQ Functions 

The lengthQ function will return the length of a vector: 


> length(c(l,2,3)) 

[ 1 ] 3 _ 

Since there are three elements in this vector, the lengthO function returns 3. 

Because everything in R is a vector, you can use the lengthO function to find the length of anything— 
even a string, like "doggies": 


> length("doggies") 

[ 1 ] 1 

R tells us that "doggies” is a vector containing one string. 
Now, if we had two strings, "doggies" and "cats", we’d get: 


> length(c("doggies",'cats")) 

[ 1 ] 2 


To find the number of characters in a string, we use the ncharQ function: 


> nchar("doggies") 

[ 1 ] 7 


Note that if we use ncharQ on the cfdoggies’V’cats”) vector, R returns a new vector containing the 
number of characters in each string: 


> nchar(c("doggies","cats")) 

[1] 7 4 


The sumQ, cumsumQ, and diffQ Functions 

The sumQ function takes a vector of numbers and adds all those numbers together: 


> sum(c(l,l,l,l,l)) 

[ 1 ] 5 


As we saw in the documentation in the previous section, sumQ takes ... as its argument, which means 
it can accept any number of values: 


> sum(2,3,l) 

[ 1 ] 6 

> sum(c(2,3),l) 

[ 1 ] 6 

> sum(c(2,3,l)) 

[ 1 ] 6 
















As you can see, no matter how many vectors we provide, sum() adds them up as though they were a 
single vector of integers. If you wanted to sum up multiple vectors, you’d call sum() on them each 
separately. 

Remember, also, that the sum() function takes the optional argument na.rm, which by default is set 
to false. The na.rm argument determines if sum() removes na values or not. 

If we leave na.rm set to false, here’s what happens if we try to use sum() on a vector with a missing 
value: 


> sum(c(l,NA,3)) 

[1] NA 


As we saw when na was introduced, adding a value to an na value results in na. If we’d like R to give 
us a number as an answer instead, we can tell sumQ to remove na values by setting na.rm = true: 


> sum(c(l,NA,3),na.rm = TRUE) 

[ 1 ] 4 


The cumsumO function takes a vector and calculates its cumulative sum —a vector of the same length 
as the input that replaces each number with the sum of the numbers that come before it (including 
that number). Here’s an example in code to make this clearer: 


> cumsum(c(l,l,l,l,l)) 

[1] 1 2 3 4 5 

> cumsum(c(2,10,20)) 

[1] 2 12 32 


The diff() function takes a vector and subtracts each number from the number that precedes it in the 
vector: 


> diff(c(l,2,3,4,5)) 

[ 1 ] 1 1 1 1 

> diff(c(2,10,3)) 

[1] 8-7 


Notice that the result of the diff() function contains one fewer element than the original vector did. 
That’s because nothing gets subtracted from the first value in the vector. 

The : operator and the seqQ Function 

Often, rather than manually listing each element of a vector, we’d prefer to generate vectors 
automatically. To automatically create a vector of whole numbers in a certain range, we can use 
the : operator to separate the start and end of the range. R can even figure out if you want to count 
up or down (the cQ wrapping this operator is not strictly necessary): 


> c(l:5) 

[1] 1 2 3 4 5 












>c(5:l) 

[1] 5 4 3 2 1 


When you use R will count from the first value to the last. 

Sometimes we’ll want to count by something other than increments of one. The seq() function allows 
us to create vectors of a sequence of values that increment by a specified amount. The arguments 
to seq() are, in order: 

1. The start of the sequence 

2. The end of the sequence 

3. The amount to increment the sequence by 
Here are some examples of using seqQ: 


> seq(l,1.1,0.05) 

[1] 1.00 1.05 1.10 

> seq(0,15,5) 

[1] 0 5 10 15 

> seq(l,2,0.3) 

[1] 1.0 1.3 1.6 1.9 


If we want to count down to a certain value using the seq() function, we use a minus value as our 
increment, like this: 


> seq(10,5,-l) 

[1] 10 9 8 7 6 5 


The ifelseQ Function 

The ifelse() function tells R to take one of two actions based on some condition. This function can be 
a bit confusing if you’re used to the normal if... else control structure in other languages. In R, it takes 
the following three arguments (in order): 

1. A statement about a vector that may be either true or false of its values 

2. What happens in the case that the statement is true 

3. What happens in the case that the statement is false 

The ifelse() function operates on entire vectors at once. When it comes to vectors containing a single 
value, its use is pretty intuitive: 


> ifelse(2 < 3,"small","too big") 

[1] "small" 


Here the statement is that 2 is smaller than 3, and we ask R to output '‘small” if it is, and "too big"if it 
isn’t. 

Suppose we have a vector x that contains multiple values: 


> x <- c(l,2,3) 











The ifelseQ function will return a value for each element in the vector: 


> ifelse(x < 3,"small","too big") 

[1] "small" "small" "too big" 


We can also use vectors in the results arguments for the ifelseQ. Suppose that, in addition to 
our x vector, we had another vector, y: 


y<-c(2,l,6) 


We want to generate a new list that contains the greatest value from x and y for each element in the 
vector. We could use ifelseQ to solve this very simply: 


> ifelse(x > y,x,y) 

[ 1 ] 2 2 6 


You can see R has compared the values in x to the respective value in y and outputs the largest of the 
two for each element. 

RANDOM SAMPLING 

We’ll often use R to randomly sample values. This allows us to have the computer pick a random 
number or value for us. We use this sample to simulate activities like flipping a coin, playing "rock, 
paper, scissors," or picking a number between 1 and 100. 

The runifQ Function 

One way to randomly sample values is with the function runifO, short for "random uniform,” which 
takes a required argument n and gives that many samples in the range 0 to 1: 


> runif(5) 

[1] 0.8688236 0.1078877 0.6814762 0.9152730 0.8702736 


We can use this function with ifelseO to generate a value A 20 percent of the time. In this case we'll 
use runif(5) to create five random values between 0 and 1. Then if the value is less than 0.2, we’ll 
return "A"; otherwise, we’ll return "B”: 


> ifelse(runif(5) < 0.2,"A","B") 

[1] "B" "B" "B" "B" "A" 


Since the numbers we’re generating are random, we’ll get a different result each time we run 
the ifelseQ function. Here are some possible outcomes: 


> ifelse(runif(5) < 0.2,"A","B") 
[1] "B" "B" "B" "B" "B" 

> ifelse(runif(5) < 0.2,"A","B") 

[1] "A" "A" "B" "B" "B" 














The runifQ function can take optional second and third arguments, which are the minimum and 
maximum values of the range to be uniformly sampled from. By default, the function uses the range 
between 0 and 1 inclusive, but you can set the range to be whatever you’d like: 


> runif(5,0,2) 

[1] 1.4875132 0.9368703 0.4759267 1.8924910 1.6925406 


The rnormO Function 

We can also sample from a normal distribution using the rnormO function, which we'll discuss in 
more depth in the book (the normal distribution is covered in Chapter 12 ): 


> rnorm(3) 

[1] 0.28352476 0.03482336 -0.20195303 


By default, rnormO samples from a normal distribution with a mean of 0 and standard deviation of 1, 
as is the case in this example. For readers unfamiliar with the normal distribution, this means that 
samples will have a "bell-shaped" distribution around 0, with most samples being close to 0 and 
very few being less than -3 or greater than 3. 

The rnormO function has two optional arguments, mean and sd, which allow you to set a different 
mean and standard deviation, respectively: 


> rnorm(4,mean=2,sd=10) 

[1] -12.801407 -9.648737 1.707625 -8.232063 


In statistics, sampling from a normal distribution is often more common than sampling from a 
uniform distribution, so rnormO comes in quite handy. 

The sampleQ Function 

Sometimes, we want to sample from something other than just a well-studied distribution. Suppose 
you have a drawer containing socks of many colors: 


socks <- c("red","grey","white","red","black") 


If you wanted to simulate the act of randomly picking any two socks, you could use 

R’s sampleOfunction, which takes as arguments a vector of values and the number of elements to 

sample: 


> sample(socks,2) 

[1] "grey""red" 


The sampleO function behaves as though we’ve picked two random socks out of the drawer— 
without putting any back. If we sample five socks, we’ll get all of the socks we originally had in the 
drawer: 













> sample(socks,5) 

[1] "grey" "red" "red" "black" "white" 


That means that if we try to take six socks from the drawer where there are only five available 
socks, we’ll get an error: 


> sample(socks,6) 

Error in sample.int(length(x), size, replace, prob): 
cannot take a sample larger than the population when 'replace = FALSE' 


If we want to both sample and "put the socks back," we can set the optional argument replace to true. 
Now, each time we sample a sock, we put it back in the drawer. This allows us to sample more socks 
than are in the drawer. It also means the distribution of socks in the drawer never changes. 


> sample(socks,6,replace=TRUE) 

[1] "black" "red" "black" "red" "black" "black" 


With these simple sampling tools, you can run surprisingly sophisticated simulations in R that save 
you from doing a lot of math. 

Using setseedQ for Predictable Random Results 

The "random numbers" generated by R aren’t truly random numbers. As in all programming 
languages, random numbers are generated by a pseudorandom number generator, which takes 
a seed value and uses that to create a sequence of numbers that are random enough for most 
purposes. The seed value sets the initial state of the random number generator and determines 
which numbers will come next in the sequence. In R, we can manually set this seed using 
the set.seed()function. Setting the seed is extremely useful for cases when we want to use the same 
random results again: 


> set.seed(1337) 

> ifelse(runif(5) < 0.2,"A”,"B”) 

[1] "B" "B" "A" "B" "B" 

> set.seed(1337) 

> ifelse(runif(5) < 0.2,"A","B") 

[1] "B" "B" "A" "B" "B" 


As you can see, when we used the same seed twice with the runif() function, it generated the same 
set of supposedly random values. The main benefit of using set.seed() is making the results 
reproducible. This can make tracking down bugs in programs that involve sampling much easier, 
since the results don’t change each time the program is run. 

DEFINING YOUR OWN FUNCTIONS 

Sometimes it’s helpful to write our own functions for specific operations we’ll have to perform 
repeatedly. In R, we can define functions using the keyword function (a keyword in a programming 
language is simply a special word reserved by the programming language for a specific use). 










Here’s the definition of a function that takes a single argument, val —which here stands for the value 
the user will input to the function—and then doubles val and cubes it. 


double_then_cube <- function(val){ 
(val*2) A 3 

} 


Once we’ve defined our function, we can use it, just like R’s built-in functions. Here’s 
our double_then_cube() function applied to the number 8: 


> double_then_cube(8) 

[1] 4096 


Also, because everything we did to define our function is vectorized (that is, all values work on 
vectors of values), our function will work on vectors as well as single values: 


> double_then_cube(c(l,2,3)) 

[1] 8 64 216 


We can define functions that take more than one argument as well. The sum_then_squareOfunction, 
defined here, adds two arguments together, then squares the result: 


sum_then_square <- function(x,y){ 
(x+y) A 2 

} 


By including the two arguments (x,y) in the function definition, we’re telling R 

the sum_then_square() function expects two arguments. Now we can use our new function, like this: 


> sum_then_square(2,3) 

[1] 25 

> sum_then_square(c(l,2),c(5,3)) 

[1] 36 25 


We can also define functions that require multiple lines. In R, when a function is called it will always 
return the result of the calculation on the final line of the function definition. That means we could 
have rewritten sum_then_squareQ like this: 


sum_then_square <- function(x,y){ 
sum_of_args <- x+y 
square_of_result <- sum_of_args A 2 
square_of_result 

} 


Typically, when you write functions, you’ll want to write them in an R script file so you can save 
them and reuse them later. 















CREATING BASIC PLOTS 

In R, we can quickly generate plots of data very easily. Though R has an extraordinary plotting 
library called ggplot2, which contains many useful functions for generating beautiful plots, we’ll 
restrict ourselves to R’s base plotting functions for now, which are plenty useful on their own. 

To show how plotting works, we’ll create two vectors of values, our xs and our ys: 

> xs <- c(l,2,3,4,5) 

> ys <- c(2,3,2,4,6) 

Next, we can use these vectors as arguments to the plot() function, which will plot our data for us. 
The plotQ function takes two arguments: the values of the plot’s points on the x-axis and the values 
of those points on the y-axis, in that order: 

> plot(xs,ys) 

This function should generate the plot shown in Figure A-4 in the bottom-left window of RStudio. 
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Figure A-4: A simple plot created with R’s plotQ function 

This plot shows the relationship between our xs values and their corresponding ys values. If we 
return to the function, we can give this plot a title using the optional main argument. We can also 
change the x- and y-axis labels with the xlab and ylab arguments, like this: 

plot(xs,ys, 

main="example plot", 
xlab="x values", 










ylab="y values 

) 


The new labels should show up as they appear in Figure A-5 
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Figure A-5: Changing the plot title and labels with the piotQ function 

We can also change the plot’s type using the type argument. The first kind of plot we generated is 
called a point plot, but if we wanted to make a line plot, which draws a line through each value, we 
could set type=T: 



plot(xs,ys, 

type=T, 

main="example plot", 
xlab="x values", 
ylab="y values" 

) 


It would then look like Figure A-6 . 








example plot 



Or we can do both! An R function called lines() can add lines to an existing plot. It takes most of the 
same arguments as plotQ: 


plot(xs,ys, 

main="example plot", 
xlab="x values", 
ylab="y values" 

) 

lines(xs,ys) 


Figure A-7 shows the plot this function would generate. 







example plot 



There are many more amazing ways to use R’s basic plots, and you can consult ?plot for more 
information on them. However, if you want to create truly beautiful plots in R, you should research 
the ggplot2 library f h tips: //ggplo t2. tidvverse. ora A . 

EXERCISE: SIMULATING A STOCK PRICE 

Now let’s put everything we’ve learned together to create a simulated stock ticker! People often 
model stock prices using the cumulative sum of normally distributed random values. To start, we’ll 
simulate stock movement for a period of time by generating a sequence of values from 1 to 20, 
incrementing by 1 each time using the seq() function. We’ll call the vector representing the period of 
time t.vals. 


t.vals <- seq(l,20,by=l) 


Now t.vals is a vector containing the sequence of numbers from 1 to 20 incremented by 1. Next, we'll 
create some simulated prices by taking the cumulative sum of a normally distributed value for each 
time in your t.vals. To do this we’ll use rnormO to sample the number of values equal to the length 
of t.vals. Then we'll use cumsum() to calculate the cumulative sum of this vector of values. This will 
represent the idea of a price moving up or down based on random motion; less extreme movements 
are more common than more extreme ones. 


price.vals <- cumsum(rnorm(length(t.vals),mean=5,sd=10)) 








Finally, we can plot all these values to see how they look! We’ll use both 

the plotQ and lmesQfunctions, and label the axes according to what they represent. 


plot(t.vals,price.vals, 

main="Simulated stock ticker", 
xlab="time", 
ylab="price") 
lines(t.vals,price.vals) 


The plotQ and HnesO functions should generate the plot shown in Figure A-8 . 

Simulated stock ticker 



SUMMARY 

This appendix should cover enough R to give you a grasp of the examples in this book. I recommend 
following along with the book’s chapters, then playing around by modifying the code examples to 
learn more. R also has some great online documentation if you want to take your experimentation 
further. 






B 

ENOUGH CALCULUS TO GET BY 



In this book, we’ll occasionally use ideas from calculus, though no actual manual solving of calculus 
problems will be required! What will be required is an understanding of some of the basics of 
calculus, such as the derivative and (especially) the integral. This appendix is by no means an 
attempt to teach these concepts deeply or show you how to solve them; instead, it offers a brief 
overview of these ideas and how they're represented in mathematical notation. 

FUNCTIONS 

A function is just a mathematical "machine” that takes one value, does something with it, and 
returns another value. This is very similar to how functions in R work (see Appendix A h they take 
in a value and return a result. For example, in calculus we might have a function called/defined like 
this: 


f{x) = X 2 


In this example,/takes a value, x, and squares it. If we input the value 3 into / for example, we get: 
/(3) = 9 

This is a little different than how you might have seen it in high school algebra, where you’d usually 
have a valuey and some equation involving x. 


y - x 2 


One reason why functions are important is that they allow us to abstract away the actual 
calculations we’re doing. That means we can say something likey =f[x ), and just concern ourselves 
with the abstract behavior of the function itself, not necessarily how it’s defined. That’s the 
approach we’ll take for this appendix. 

As an example, say you're training to run a 5 km race and you’re using a smartwatch to keep track 
of your distance, speed, time, and other factors. You went out for a run today and ran for half an 
hour. However, your smartwatch malfunctioned and recorded only your speed in miles per hour 
(mph) throughout your 30-minute run. Figure B-l shows the data you were able to recover. 

For this appendix, think of your running speed as being created by a function, s, that takes an 
argument t, the time in hours. A function is typically written in terms of the argument it takes, so we 
would write s(t), which results in a value that gives your current speed at time t. You can think of 




the function 5 as a machine that takes the current time and returns your speed at that time. In 
calculus, we’d usually have a specific definition of s(t), such as s(t) = T + 3t + 2, but here we’re just 
talking about general concepts, so we won’t worry about the exact definition of s. 


OTE 


Throughout the book we’ll be using R to handle all our calculus needs, so it’s really only important that 
you understand the fundamental ideas behind it, rather than the mechanics of solving calculus 
problems. 

From this function alone, we can learn a few things. It’s clear that your pace was a little uneven 
during this run, going up and down from a high of nearly 8 mph near the end and a low of just 
under 4.5 mph in the beginning. 

Running speed (mph) recovered from watch 



However, there are still a lot of interesting questions you might want to answer, such as: 

• How far did you run? 

• When did you lose the most speed? 

• When did you gain the most speed? 

• During what times was your speed relatively consistent? 

We can make a fairly accurate estimate of the last question from this plot, but the others seem 
impossible to answer from what we have. However, it turns out that we can answer all of these 
questions with the power of calculus! Let’s see how. 




























Determining How Far You've Run 

So far our chart just shows your running speed at a certain time, so how do we find out how far 
you’ve run? 

This doesn’t sound too difficult in theory. Suppose, for example, you ran 5 mph consistently for the 
whole run. In that case, you ran 5 mph for 0.5 hour, so your total distance was 2.5 miles. This 
intuitively makes sense, since you would have run 5 miles each hour, but you ran for only half an 
hour, so you ran half the distance you would have run in an hour. 

But our problem involves a different speed at nearly every moment that you were running. Let’s 
look at the problem another way. Figure B-2 shows the plotted data for a constant running speed. 

Running at a constant speed 



Time (hours) 

Figure B-2: Visualizing distance as the area of the speed/time plot 

You can see that this data creates a straight line. If we think about the space under this line, we can 
see that it’s a big block that actually represents the distance you’ve run! The block is 5 high and 0.5 
long, so the area of this block is 5 x 0.5 = 2.5, which gives us the 2.5 miles result! 

Now let’s look at a simplified problem with varying speeds, where you ran 4.5 mph from 0.0 to 0.3 
hours, 6 mph from 0.3 to 0.4 hours, and 3 mph the rest of the way to 0.5 miles. If we visualize these 
results as blocks, or towers, as in Figure B-3. we can solve our problem the same way. 

The first tower is 4.5 x 0.3, the second is 6 x 0.1, and the third is 3 x 0.1, so that: 


4.5 x 0.3 + 6 x 0.1 + 3 x 0.1 = 2.25 
























By looking at the area under the tower, then, we get the total distance you traveled: 2.25 miles. 

Determining distance as the area of multiple blocks 



Time (hours) 

Figure B-3: We can easily calculate your total distance traveled by adding together these towers. 

Measuring the Area Under the Curve: The Integral 

You’ve now seen that we can figure out the area under the line to tell us how far you traveled. 
Unfortunately, the line for our original data is curved, which makes our problem a bit difficult: how 
can we calculate the towers under our curvy line? 

We can start this process by imagining some large towers that are fairly close to the pattern of our 
curve. If we start with just three towers, as we can see in Figure B-4. it isn’t a bad estimate. 



























Estimating our curve with three towers 
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Figure B-4: Approximating the curve with three towers 

By calculating the area under each of these towers, we get a value of 3.055 miles for your estimated 
total miles traveled. But we could clearly do better by making more, smaller towers, as shown 
in Figure B-5 . 













Estimating our curve with 10 towers 
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Figure B-5: Approximating the curve better by using 10 towers instead of 3 

Adding up the areas of these towers, we get 3.054 miles, which is a more accurate estimate. 

If we imagine repeating this process forever, using more and thinner towers, eventually we would 
get the full area under the curve, as in Figure B-6 . 
























Using an infinite number of towers 
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Figure B-6: Completely capturing the area under the curve 

This represents the exact area traveled for your half-hour run. If we could add up infinitely many 
towers, we would get a total of 3.053 miles. Our estimates were pretty close, and as we use more 
and smaller towers, our estimate gets closer. The power of calculus is that it allows us to calculate 
this exact area under the curve, or the integral. In calculus, we’d represent the integral for our s(t) 
from 0 to 0.5 in mathematical notation as: 
r o.5 . . 

J 0 <‘) dt 

That | is just a fancy S, meaning the sum (or total) of the area of all the little towers in s(t). 

The dtnotation reminds us that we’re talking about little bits of the variable t; the d is a 
mathematical way to refer to these little towers. Of course, in this bit of notation, there’s only one 
variable, t, so we aren’t likely to get confused. Likewise, in this book, we typically drop the dt (or its 
equivalent for the variable being used) since it’s obvious in the examples. 

In our last notation we set the beginning and end of our integral, which means we can find the 
distance not just for the whole run but also for a section of it. Suppose we wanted to know how far 
you ran between 0.1 to 0.2 of an hour. We would note this as: 

C°.2 

J„ 

We can visualize this integral as shown in Figure B-7 . 




The integral of the region 0.1 to 0.2 
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0.0 0.1 0.2 0.3 0.4 0.5 
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Figure B-7: Visualizing the area under the curve for the region from 0.1 to 0.2 
The area of just this shaded region is 0.556 miles. 

We can even think of the integral of our function as another function. Suppose we define a new 
function, dist(T), where T is our "total time run": 

dist(T) = J* s(t)dt 

This gives us a function that tells us the distance you’ve traveled at time T. We can also see why we 
want to use dt because we can see that our integral is being applied to the lowercase targument 
rather than the capital T argument. Figure B-8 plots this out to the total distance you’ve run at any 
given time T during your run. 
































Distance traveled over time as the integral of speed over time 
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Figure B-8: Plotting out the integral transforms a time and speed plot to a time and distance plot. 



In this way, the integral has transformed our function s, which was "speed at a time," to a 
function dist, "distance covered at a time." As shown earlier, the integral of our function between 
two points represents the distance traveled between two different times. Now we’re looking at the 
total distance traveled at any given time t from the beginning time of 0. 

The integral is important because it allows us to calculate the area under curves, which is much 
trickier to calculate than if we have straight lines. In this book, we’ll use the concept of the integral 
to determine the probabilities that events are between two ranges of values. 

Measuring the Rate of Change: The Derivative 

You’ve seen how we can use the integral to figure out the distance traveled when all we have is a 
recording of your speed at various times. But with our varying speed measurements, we might also 
be interested in figuring out the rate of change for your speed at various times. When we talk about 
the rate at which speed is changing, we’re referring to acceleration. In our chart, there are a few 
interesting points regarding the rate of change: the points when you’re losing speed the fastest, 
when you’re gaining speed the fastest, and when the speed is the most steady (i.e., the rate of 
change is near 0). 

Just as with integration, the main challenge of figuring out your acceleration is that it seems to 
always be changing. If we had a constant rate of change, calculating the acceleration isn’t that 
difficult, as shown in Figure B-9 . 






















A constant rate of increase in speed 
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Figure B-9: Visualizing a constant rate of change (compared with your actual changing rate) 

You might remember from basic algebra that we can draw any line using this formula: 
y - mx + b 

where b is the point at which the line crosses the y-axis and m is the slope of the line. 

The s/operepresents the rate of change of a straight line. For the line in Figure B-9. the full formula 
is: 

y - Sx + 4.8 

The slope of 5 means that for every time x grows by l,y grows by 5; 4.8 is the point at which the 
line crosses the x-axis. In this example, we’d interpret this formula as s(t) = St + 4.8, meaning that 
for every mile you travel you accelerate by 5 mph, and that you started off at 4.8 mph. Since you've 
run half a mile, using this simple formula, we can figure out: 

s(t) = 5x0.5 + 4.8 = 7.3 

which means at the end of your run, you would be traveling 7.3 mph. We could similarly determine 
your exact speed at any point in the run, as long as the acceleration is constant! 
































For our actual data, because the line is curvy it’s not easy to determine the slope at a single point in 
time. Instead, we can figure out the slopes of parts of the line. If we divide our data into three 
subsections, we could draw lines between each part as in Figure B-10 . 

Approximating the change in speed at different times 
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Figure B-l 0: Using multiple slopes to get a better estimate of your rate of change 



Now, clearly these lines aren’t a perfect fit to our curvy line, but they allow us to see the parts 
where you accelerated the fastest, slowed down the most, and were relatively stable. 

If we split our function up into even more pieces we can get even better estimates, as in Figure B- 


11 . 






























Approximating the change in speed at different times 
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Figure B-ll: Adding more slopes allows us to better approximate your curve. 


Here we have a similar pattern to when we found the integral, where we split the area under the 
curve into smaller and smaller towers until we were adding up infinitely many small towers. Now 
we want to break up our line into infinitely many small line segments. Eventually, rather than a 
single m representing our slope, we have a new function representing the rate of change at each 
point in our original function. This is called the derivative, represented in mathematical notation 
like this: 



Again, the dx just reminds us that we’re looking at very small pieces of our argument x. Figure B- 
12 shows the plot of the derivative for our s(t) function, which allows us to see the exact rate of 
speed change at each moment in your run. In other words, this is a plot of your acceleration during 
your run. Looking at the y-axis, you can see that you rapidly lost speed in the beginning, and at 
around 0.3 hours you had a period of 0 acceleration, meaning your pace did not change (this is 
usually a good thing when practicing for a race!). We can also see exactly when you gained the most 
speed. Looking at the original plot, we couldn’t easily tell if you were gaining speed faster around 
0.1 hours (just after your first speedup) or at the end of your run. With the derivative, though, it’s 
clear that the final burst of speed at the end was indeed faster than at the beginning. 































The derivative of speed: acceleration 
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Figure B-12: The derivative is another function that describes the slope o/s(x) at each point. 


The derivative works just like the slope of a straight line, only it tells us how much a curvy line is 
sloping at a certain point. 


THE FUNDAMENTAL THEOREM OF CALCULUS 

We’ll look at one last truly remarkable calculus concept. There’s a very interesting relationship 
between the integral and the derivative. (Proving this relationship is far beyond the scope of this 
book, so we'll focus only on the relationship itself here.) Suppose we have a function F(x), with a 
capital F. What makes this function special is that its derivative i sf[x). For example, the derivative of 
our dist function is our s function; that is, your change in distance at each point in time is your speed. 
The derivative of speed is acceleration. We can describe this mathematically as: 

/(*) 

ax 

In calculus terms we call Fthe antiderivative of f because/is F’s derivative. Given our examples, the 
antiderivative of acceleration would be speed, and the antiderivative of speed would be distance. 
Now suppose for any value of/ we want to take its integral between 10 and 50; that is, we want: 


































We can get this simply by subtracting F(10) from F(50), so that: 

r 

[„/(*)<** = ^(50)-F(10) 

The relationship between the integral and the derivative is called the fundamental theorem of 
calculus. It’s a pretty amazing tool, because it allows us to solve integrals mathematically, which is 
often much more difficult than finding derivatives. Using the fundamental theorem, if we can find 
the antiderivative of the function we want to find the integral of, we can easily perform integration. 
Figuring this out is the heart of performing integration by hand. 

A full course on calculus (or two) typically explores the topics of integrals and derivatives in much 
greater depth. However, as mentioned, in this book we’ll only be making occasional use of calculus, 
and we'll be using R for all of the calculations. Still, it’s helpful to have a rough understanding of 
what calculus and those unfamiliar f symbols are all about! 



MAKE SENSE OF YOUR DATA — THE FUN WAY! 



With any given problem, traditional statistical analysis often just generates another pile of data. But 
how do you make real-world sense of these cold, hard numbers? Bayesian Statistics the Fun 
Way shows you how to make better probabilistic decisions using your natural intuition and some 
simple math. 

This accessible primer shows you howto apply Bayesian methods through clear explanations and 
fun examples. You’ll go UFO hunting to explore everyday reasoning, calculate whether Han Solo will 
survive an asteroid field using probability distributions, and quantify the probability that you have 
a serious brain tumor and not just too much ear wax. 

These eclectic exercises will help you build a flexible and robust framework for working through a 
wide range of challenges, from truly grokking current events to handling the daily surprises of the 
business world. 

You'll learn how to: 

• Calculate distributions to see the range of your beliefs 

• Compare hypotheses and draw reliable conclusions 

• Calculate Bayes’ theorem and understand what it’s useful for 

• Find the posterior, likelihood, and prior to check the accuracy of your conclusions 

• Use the R programming language to perform data analysis 

Make better choices with more confidence—and enjoy doing it! Crack open Bayesian Statistics the 
Fun Way to get the most value from your data. 
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PREFACE 



Karma Police, arrest this man, he talks in maths, he buzzes like a fridge, he’s like 
a detuned radio. 

Radiohead, ‘Karma Police’, OK Computer (1997) 


Introduction 


Many social science students (and researchers for that matter) despise statistics. For one 
thing, most of us have a non-mathematical background, which makes understanding com¬ 
plex statistical equations very difficult. Nevertheless, the evil goat-warriors of Satan force our 
non-mathematical brains to apply themselves to what is, essentially, the very complex task of 
becoming a statistics expert. The end result, as you might expect, can be quite messy. The one 
weapon that we have is the computer, which allows us to neatly circumvent the considerable 
disability that is not understanding mathematics. The advent of computer programs such as 
SAS, SPSS, R and the like provides a unique opportunity to teach statistics at a conceptual 
level without getting too bogged down in equations. The computer to a goat-warrior of Satan 
is like catnip to a cat: it makes them rub their heads along the ground and purr and dribble 
ceaselessly. The only downside of the computer is that it makes it really easy to make a com¬ 
plete idiot of yourself if you don’t really understand what you’re doing. Using a computer 
without any statistical knowledge at all can be a dangerous thing. Hence this book. Well, 
actually, hence a book called Discovering Statistics Using SPSS. 

I wrote Discovering Statistics Using SPSS just as I was finishing off my Ph.D. in Psychology. 
My main aim was to write a book that attempted to strike a good balance between theory and 
practice: I wanted to use the computer as a tool for teaching statistical concepts in the hope 
that you will gain a better understanding of both theory and practice. If you want theory 
and you like equations then there are certainly better books: Howell (2006), Stevens (2002) 
and Tabachnick and Fidell (2007) are peerless as far as I am concerned and have taught me 
(and continue to teach me) more about statistics than you could possibly imagine. (I have an 
ambition to be cited in one of these books but I don’t think that will ever happen.) However, 
if you want a book that incorporates digital rectal stimulation then you have just spent your 
money wisely. (I should probably clarify that the stimulation is in the context of an example, 
you will not find any devices attached to the inside cover for you to stimulate your rectum 
while you read. Please feel free to get your own device if you think it will help you to learn.) 

A second, not in any way ridiculously ambitious, aim was to make this the only statistics 
textbook that anyone ever needs to buy. As such, it’s a book that I hope will become your 
friend from first year right through to your professorship. I’ve tried to write a book that can 
be read at several levels (see the next section for more guidance). There are chapters for first- 
year undergraduates (1, 2, 3, 4, 5, 6, 9 and 15), chapters for second-year undergraduates (5, 
7, 10, 11, 12, 13 and 14) and chapters on more advanced topics that postgraduates might use 
(8, 16, 17, 18 and 19). All of these chapters should be accessible to everyone, and I hope to 
achieve this by flagging the level of each section (see the next section). 
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My third, final and most important aim is make the learning process fun. I have a sticky 
history with maths because I used to be terrible at it: 


MATHEMATICS 
ADDL. MATHS. 

43 
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Above is an extract of my school report at the age of 11. The ‘27=’ in the report is to say 
that I came equal 27th with another student out of a class of 29. That’s almost bottom of 
the class. The 43 is my exam mark as a percentage. Oh dear. Four years later (at 15) this 
was my school report: 
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What led to this remarkable change? It was having a good teacher: my brother, Paul. In 
fact I owe my life as an academic to Paul’s ability to do what my maths teachers couldn’t: 
teach me stuff in an engaging way. To this day he still pops up in times of need to teach 
me things (many tutorials in computer programming spring to mind). Anyway, the reason 
he’s a great teacher is because he’s able to make things interesting and relevant to me. He 
got the ‘good teaching’ genes in the family, but they’re wasted because he doesn’t teach for 
a living; they’re a little less wasted though because his approach inspires my lectures and 
books. One thing that I have learnt is that people appreciate the human touch, and so I 
tried to inject a lot of my own personality and sense of humour (or lack of) into Discovering 
Statistics Using ... books. Many of the examples in this book, although inspired by some of 
the craziness that you find in the real world, are designed to reflect topics that play on the 
minds of the average student (i.e., sex, drugs, rock and roll, celebrity, people doing crazy 
stuff). There are also some examples that are there just because they made me laugh. So, 
the examples are light-hearted (some have said ‘smutty’ but I prefer ‘light-hearted’) and by 
the end, for better or worse, I think you will have some idea of what goes on in my head 
on a daily basis. I apologize to those who think it’s crass, hate it, or think that I’m under¬ 
mining the seriousness of science, but, come on, what’s not funny about a man putting an 
eel up his anus? 

Did I succeed in these aims? Maybe I did, maybe I didn’t, but the SPSS book on which 
this R book is based has certainly been popular and I enjoy the rare luxury of having many 
complete strangers emailing me to tell me how wonderful I am. (Admittedly, occassionally 
people email to tell me that they think I’m a pile of gibbon excrement but you have to take 
the rough with the smooth.) It also won the British Psychological Society book award in 
2007. I must have done something right. However, Discovering Statistics Using SPSS has 
one very large flaw: not everybody uses SPSS. Some people use R. R has one fairly big 
advantage over other statistical packages in that it is free. That’s right, it’s free. Completely 
and utterly free. People say that there’s no such thing as a free lunch, but they’re wrong: 
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R is a feast of succulent delights topped off with a baked cheesecake and nothing to pay at 
the end of it. 

It occurred to me that it would be great to have a version of the book that used all of 
the same theory and examples from the SPSS book but written about R. Genius. Genius 
except that I knew very little about R. Six months and quite a few late nights later and I 
know a lot more about R than I did when I started this insane venture. Along the way I have 
been helped by a very nice guy called Jeremy (a man who likes to put eels in his CD player 
rather than anywhere else), and an even nicer wife. Both of their contributions have been 
concealed somewhat by our desire to keep the voice of the book mine, but they have both 
contributed enormously. (Jeremy’s contributions are particularly easy to spot: if it reads 
like a statistics genius struggling manfully to coerce the words of a moron into something 
approximating factual accuracy, then Jeremy wrote it.) 

What are you getting for your money? 


This book takes you on a journey (possibly through a very narrow passage lined with 
barbed wire) not just of statistics but of the weird and wonderful contents of the world and 
my brain. In short, it’s full of stupid examples, bad jokes, smut and filth. Aside from the 
smut, I have been forced reluctantly to include some academic content. Over many editions 
of the SPSS book many people have emailed me with suggestions, so, in theory, what you 
currently have in your hands should answer any question anyone has asked me over the 
past ten years. It won’t, but it should, and I’m sure you can find some new questions to ask. 
It has some other unusual features: 

• Everything you’ll ever need to know: I want this to be good value for money so the 
book guides you from complete ignorance (Chapter 1 tells you the basics of doing 
research) to being an expert on multilevel modelling (Chapter 19). Of course no 
book that you can actually lift off the floor will contain everything, but I think this 
one has a fair crack at taking you from novice to postgraduate level expertise. It’s 
pretty good for developing your biceps also. 

• Stupid faces: You’ll notice that the book is riddled with stupid faces, some of them 
my own. You can find out more about the pedagogic function of these ‘characters’ 
in the next section, but even without any useful function they’re still nice to look at. 

• Data sets: There are about 100 data files associated with this book on the companion 
website. Not unusual in itself for a statistics book, but my data sets contain more 
sperm (not literally) than other books. I’ll let you judge for yourself whether this is 
a good thing. 

• My life story: Each chapter is book-ended by a chronological story from my life. 
Does this help you to learn about statistics? Probably not, but hopefully it provides 
some light relief between chapters. 

• R tips: R does weird things sometimes. In each chapter, there are boxes containing 
tips, hints and pitfalls related to R. 

• Self-test questions: Given how much students hate tests, I thought the best way to 
commit commercial suicide was to liberally scatter tests throughout each chapter. 
These range from simple questions to test what you have just learned to going back 
to a technique that you read about several chapters before and applying it in a new 
context. All of these questions have answers to them on the companion website. They 
are there so that you can check on your progress. 
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The book also has some more conventional features: 

• Reporting your analysis: Every single chapter has a guide to writing up your 
analysis. Obviously, how one writes up an analysis varies a bit from one discipline to 
another and, because I’m a psychologist, these sections are quite psychology-based. 
Nevertheless, they should get you heading in the right direction. 

• Glossary: Writing the glossary was so horribly painful that it made me stick a vacuum 
cleaner into my ear to suck out my own brain. You can find my brain in the bottom 
of the vacuum cleaner in my house. 

• Real-world data: Students like to have ‘real data’ to play with. The trouble is that real 
research can be quite boring. However, just for you, I trawled the world for examples 
of research on really fascinating topics (in my opinion). I then stalked the authors of 
the research until they gave me their data. Every chapter has a real research example. 


Goodbye 


The SPSS version of this book has literally consumed the last 13 years or so of my life, 
and this R version has consumed the last 6 months. I am literally typing this as a withered 
husk. I have no idea whether people use R, and whether this version will sell, but I think 
they should (use R, that is, not necessarily buy the book). The more I have learnt about R 
through writing this book, the more I like it. 

This book in its various forms has been a huge part of my adult life; it began as and con¬ 
tinues to be a labour of love. The book isn’t perfect, and I still love to have feedback (good 
or bad) from the people who matter most: you. 


Andy 

• Contact details: http://www. discoveringstatistics.com/html/email.html 

• Twitter: @ProfAndyField 

• Blog: http://www.methodspace.com/profile/ProfessorAndyField 




HOW TO USE THIS BOOK 



When the publishers asked me to write a section on ‘How to use this book’ it was obvi¬ 
ously tempting to write ‘Buy a large bottle of Olay anti-wrinkle cream (which you’ll need 
to fend off the effects of ageing while you read), find a comfy chair, sit down, fold back the 
front cover, begin reading and stop when you reach the back cover.’ However, I think they 
wanted something more useful. © 


What background knowledge do I need? 


In essence, I assume you know nothing about statistics, but I do assume you have some very 
basic grasp of computers (I won’t be telling you how to switch them on, for example) and 
maths (although I have included a quick revision of some very basic concepts so I really 
don’t assume anything). 

Do the chapters get more difficult as I go through 
the book? 


In a sense they do (Chapter 16 on MANOVA is more difficult than Chapter 1), but in other 
ways they don’t (Chapter 15 on non-parametric statistics is arguably less complex than Chapter 
14, and Chapter 9 on the t-test is definitely less complex than Chapter 8 on logistic regression). 
Why have I done this? Well, I’ve ordered the chapters to make statistical sense (to me, at least). 
Many books teach different tests in isolation and never really give you a grip of the similari¬ 
ties between them; this, I think, creates an unnecessary mystery. Most of the tests in this book 
are the same thing expressed in slightly different ways. So, I wanted the book to tell this story. 
To do this I have to do certain things such as explain regression fairly early on because it’s the 
foundation on which nearly everything else is built. 

However, to help you through I’ve coded each section with an icon. These icons are 
designed to give you an idea of the difficulty of the section. It doesn’t necessarily mean 
you can skip the sections (but see Smart Alex in the next section), but it will let you know 
whether a section is at about your level, or whether it’s going to push you. I’ve based the 
icons on my own teaching so they may not be entirely accurate for everyone (especially as 
systems vary in different countries!): 

© This means ‘level 1’ and I equate this to first-year undergraduate in the UK. These are 
sections that everyone should be able to understand. 

© This is the next level and I equate this to second-year undergraduate in the UK. These 
are topics that I teach my second years and so anyone with a bit of background in sta¬ 
tistics should be able to get to grips with them. However, some of these sections will 
be quite challenging even for second years. These are intermediate sections. 
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© This is ‘level 3’ and represents difficult topics. I’d expect third-year (final-year) UK 
undergraduates and recent postgraduate students to be able to tackle these sections. 

@ This is the highest level and represents very difficult topics. I would expect these sec¬ 
tions to be very challenging to undergraduates and recent postgraduates, but post¬ 
graduates with a reasonable background in research methods shouldn’t find them too 
much of a problem. 


Why do I keep seeing stupid faces everywhere? 



Brian Haemorrhage: Brian’s job is to pop up to ask questions and look permanently 
confused. It’s no surprise to note, therefore, that he doesn’t look entirely different from 
the author (he has more hair though). As the book progresses he becomes increasingly 
despondent. Read into that what you will. 


Curious Cat: He also pops up and asks questions (because he’s curious). Actually the only 
reason he’s here is because I wanted a cat in the book ... and preferably one that looks like 
mine. Of course the educational specialists think he needs a specific role, and so his role is 
to look cute and make bad cat-related jokes. 



Cramming Sam: Samantha hates statistics. In fact, she thinks it’s all a boring waste of time 
and she just wants to pass her exam and forget that she ever had to know anything about 
normal distributions. So, she appears and gives you a summary of the key points that you 
need to know. If, like Samantha, you’re cramming for an exam, she will tell you the essen¬ 
tial information to save you having to trawl through hundreds of pages of my drivel. 



Jane Superbrain: Jane is the cleverest person in the whole universe (she makes Smart Alex 
look like a bit of an imbecile). The reason she is so clever is that she steals the brains of 
statisticians and eats them. Apparently they taste of sweaty tank tops, but nevertheless she 
likes them. As it happens she is also able to absorb the contents of brains while she eats 
them. Having devoured some top statistics brains she knows all the really hard stuff and 
appears in boxes to tell you really advanced things that are a bit tangential to the main text. 
(Readers should note that Jane wasn’t interested in eating my brain. That tells you all that 
you need to know about my statistics ability.) 



Labcoat Leni: Leni is a budding young scientist and he’s fascinated by real research. He says, 
Andy, man, I like an example about using an eel as a cure for constipation as much as the 
next man, but all of your examples are made up. Real data aren’t like that, we need some real 
examples, dude!’ So off Leni went; he walked the globe, a lone data warrior in a thankless quest 
for real data. He turned up at universities, cornered academics, kidnapped their families and 
threatened to put them in a bath of crayfish unless he was given real data. The generous ones 
relented, but others? Well, let’s just say their families are sore. So, when you see Leni you know 
that you will get some real data, from a real research study to analyse. Keep it real. 
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Oliver Twisted: With apologies to Charles Dickens, Oliver, like the more famous fictional 
London urchin, is always asking ‘Please Sir, can I have some more?’ Unlike Master Twist 
though, our young Master Twisted always wants more statistics information. Of course he 
does, who wouldn’t? Let us not be the ones to disappoint a young, dirty, slightly smelly 
boy who dines on gruel, so when Oliver appears you can be certain of one thing: there is 
additional information to be found on the companion website. (Don’t be shy; download it 
and bathe in the warm asp’s milk of knowledge.) 



R’s Souls: People who love statistics are damned to hell for all eternity, people who like R even 
more so. However, R and statistics are secretly so much fun that Satan is inundated with new 
lost souls, converted to the evil of statistical methods. Satan needs a helper to collect up all the 
souls of those who have been converted to the joy of R. While collecting the souls of the statis¬ 
tical undead, they often cry out useful tips to him. He’s collected these nuggets of information 
and spread them through the book like a demonic plague of beetles. When Satan’s busy spank¬ 
ing a goat, his helper pops up in a box to tell you some of R’s Souls’ Tips. 



Smart Alex: Alex is a very important character because he appears when things get par¬ 
ticularly difficult. He’s basically a bit of a smart alec and so whenever you see his face you 
know that something scary is about to be explained. When the hard stuff is over he reap¬ 
pears to let you know that it’s safe to continue. Now, this is not to say that all of the rest 
of the material in the book is easy, he just lets you know the bits of the book that you can 
skip if you’ve got better things to do with your life than read all 1000 pages! So, if you 
see Smart Alex then you can skip the section entirely and still understand what’s going on. 
You’ll also find that Alex pops up at the end of each chapter to give you some tasks to do 
to see whether you’re as smart as he is. 



What is on the companion website? 


In this age of downloading, CD-ROMs are for losers (at least that’s what the ‘kids’ tell me) 
so I’ve put my cornucopia of additional funk on that worldwide interweb thing. This has 
two benefits: 1) the book is slightly lighter than it would have been, and 2) rather than 
being restricted to the size of a CD-ROM, there is no limit to the amount of fascinating 
extra material that I can give you (although Sage have had to purchase a new server to fit 
it all on). To enter my world of delights, go to www.sagepub.co.uk/dsur. 

How will you know when there are extra goodies on this website? Easy-peasy, Oliver 
Twisted appears in the book to indicate that there’s something you need (or something 
extra) on the website. The website contains resources for students and lecturers alike: 

• Data files: You need data files to work through the examples in the book and they 
are all on the companion website. We did this so that you’re forced to go there and 
once you’re there Sage will flash up subliminal messages that make you buy more of 
their books. 

• R script files: if you put all of the R commands in this book next to each other and printed 
them out you’d have a piece of paper that stretched from here to the Tarantula Nebula 
(which actually exists and sounds like a very scary place). If you type all of these com¬ 
mands into R you will wear away your fingers to small stumps. I would never forgive 
myself if you all got stumpy fingers so the website has script files containing every single 
R command in the book (including within chapter questions and activities). 
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• Webcasts: My publisher thinks that watching a film of me explaining what this book 
is all about will get people flocking to the bookshop. I think it will have people flock¬ 
ing to the medicine cabinet. Either way, if you want to see how truly uncharismatic I 
am, watch and cringe. There are also a few webcasts of lectures given by me relevant 
to the content of the book. 

• Self-Assessment Multiple-Choice Questions: Organized by chapter, these will allow 
you to test whether wasting your life reading this book has paid off so that you can 
walk confidently into an examination much to the annoyance of your friends. If you 
fail said exam, you can employ a good lawyer and sue. 

• Additional material: Enough trees have died in the name of this book, but still it 
gets longer and still people want to know more. Therefore, we’ve written nearly 300 
pages, yes, three hundred, of additional material for the book. So for some more 
technical topics and help with tasks in the book the material has been provided elec¬ 
tronically so that (1) the planet suffers a little less, and (2) you won’t die when the 
book falls off of your bookshelf onto your head. 

• Answers: each chapter ends with a set of tasks for you to test your newly acquired 
expertise. The chapters are also littered with self-test questions and Labcoat Leni’s 
assignments. How will you know if you get these correct? Well, the companion web¬ 
site contains around 300 pages (that’s a different 300 pages to the 300 above) of 
detailed answers. Will we ever stop writing? 

• Powerpoint slides: I can’t come and personally teach you all. Instead I rely on a crack 
team of highly skilled and super intelligent pan-dimensional beings called ‘lecturers’. 
I have personally grown each and every one of them in a greenhouse in my garden. 
To assist in their mission to spread the joy of statistics I have provided them with 
powerpoint slides for each chapter. 

• Links: every website has to have links to other useful websites and the companion 
website is no exception. 

• Cyberworms of knowledge: I have used nanotechnology to create cyberworms that 
crawl down your broadband connection, pop out of the USB port of your computer 
then fly through space into your brain. They re-arrange your neurons so that you 
understand statistics. You don’t believe me? Well, you’ll never know for sure unless 
you visit the companion website ... 

Happy reading, and don’t get sidetracked by Facebook and Twitter. 
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SYMBOLS USED IN THIS BOOK 


Mathematical operators 


s 

This symbol (called sigma) means ‘add everything up’. So, if you see something 
like lx. it just means ‘add up all of the scores you’ve collected’. 

n 

This symbol means ’multiply everything’. So, if you see something like nx ( it just 
means ‘multiply all of the scores you’ve collected’. 

jx 

This means ‘take the square root of x’. 


Greek symbols 


a 

The probability of making a Type 1 error 

p 

The probability of making a Type II error 

A 

Standardized regression coefficient 

X 2 

Chi-square test statistic 

4 

Friedman’s ANOVA test statistic 

£ 

Usually stands for ’error’ 

rf 

Eta-squared 

P 

The mean of a population of scores 

P 

The correlation in the population 

a 2 

The variance in a population of data 

<7 

The standard deviation in a population of data 


The standard error of the mean 

T 

Kendall’s tau (non-parametric correlation coefficient) 

w 2 

Omega squared (an effect size measure). This symbol also means ‘expel the 
contents of your intestine immediately into your trousers’; you will understand why in 
due course. 


























SYMBOLS USED IN THIS BOOK 


English symbols 


b , 

The regression coefficient (unstandardized) 

df 

Degrees of freedom 

e , 

The error associated with the /th person 

F 

F-ratio (test statistic used in ANOVA) 

H 

Kruskal-Wallis test statistic 

k 

The number of levels of a variable (i.e. the number of treatment conditions), or the 
number of predictors in a regression model 

In 

Natural logarithm 

MS 

The mean squared error. The average variability in the data 

N, n, n. 

The sample size. N usually denotes the total sample size, whereas n usually 
denotes the size of a particular group 

P 

Probability (the probability value, p-value or significance of a test are usually 
denoted byp) 

r 

Pearson’s correlation coefficient 

r 

s 

Spearman’s rank correlation coefficient 

r.r. 

b, pb 

Biserial correlation coefficient and point-biserial correlation coefficient respectively 

R 

The multiple correlation coefficient 

R 2 

The coefficient of determination (i.e. the proportion of data explained by the model) 

s 2 

The variance of a sample of data 

s 

The standard deviation of a sample of data 

SS 

The sum of squares, or sum of squared errors to give it its full title 

SS A 

The sum of squares for variable A 

ss M 

The model sum of squares (i.e. the variability explained by the model fitted to the data) 

SS R 

The residual sum of squares (i.e. the variability that the model can’t explain - the 
error in the model) 

SS T 

The total sum of squares (i.e. the total variability within the data) 

t 

Test statistic for Student's f-test 

T 

Test statistic for Wilcoxon’s matched-pairs signed-rank test 

u 

Test statistic for the Mann-Whitney test 

w 

s 

Test statistic for the Shapiro-Wilk test and the Wilcoxon's rank-sum test 

X orx 

The mean of a sample of scores 

z 

A data point expressed in standard deviation units 


xxxiii 


































SOME MATHS REVISION 


1 Two negatives make a positive: Although in life two wrongs don’t make a right, in 
mathematics they do! When we multiply a negative number by another negative 
number, the result is a positive number. For example, -2 x -4 = 8. 

2 A negative number multiplied by a positive one make a negative number: If you 

multiply a positive number by a negative number then the result is another negative 
number. For example, 2 x -4 = -8, or -2 x 6 = -12. 

3 BODMAS: This is an acronym for the order in which mathematical operations 
are performed. It stands for Brackets, Order, Division, Multiplication, Addition, 
Subtraction and this is the order in which you should carry out operations within an 
equation. Mostly these operations are self-explanatory (e.g., always calculate things 
within brackets first) except for order, which actually refers to power terms such as 
squares. Four squared, or 4 2 , used to be called four raised to the order of 2, hence the 
reason why these terms are called ‘order’ in BODMAS (also, if we called it power, 
we’d end up with BPDMAS, which doesn’t roll off the tongue quite so nicely). Let’s 
look at an example of BODMAS: what would be the result of 1 + 3 x 5 2 ? The answer 
is 76 (not 100 as some of you might have thought). There are no brackets so the 
first thing is to deal with the order term: 5 2 is 25, so the equation becomes 1 + 3 x 
25. There is no division, so we can move on to multiplication: 3 x 25, which gives 
us 75. BODMAS tells us to deal with addition next: 1 + 75, which gives us 76 and 
the equation is solved. If Fd written the original equation as (1 + 3) x 5 2 , then the 
answer would have been 100 because we deal with the brackets first: (1 + 3) = 4, 
so the equation becomes 4 x 5 2 . We then deal with the order term, so the equation 
becomes 4 x 25 = 100! 

4 www.bbc.co.uk/schools/gcsebitesize/maths is a good site for revising basic maths. 



Why is my evil lecturer 

forcing me to learn statistics? \ 




FIGURE 1.1 

When I grow up, 
please don’t let 
me be a statistics 
lecturer 


1.1. What will this chapter tell me? © 


I was born on 21 June 1973. Like most people, I don’t remember anything about the first 
few years of life and like most children I did go through a phase of driving my parents 
mad by asking ‘Why?’ every five seconds. ‘Dad, why is the sky blue?’, ‘Dad, why doesn’t 
mummy have a willy?’, etc. Children are naturally curious about the world. I remember 
at the age of 3 being at a party of my friend Obe (this was just before he left England 
to return to Nigeria, much to my distress). It was a hot day, and there was an electric 
fan blowing cold air around the room. As I said, children are natural scientists and my 
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little scientific brain was working through what seemed like a particularly pressing ques¬ 
tion: ‘What happens when you stick your finger in a fan?’ The answer, as it turned out, 
was that it hurts - a lot. 1 My point is this: my curiosity to explain the world never went 
away, and that’s why I’m a scientist, and that’s also why your evil lecturer is forcing you 
to learn statistics. It’s because you have a curious mind too and you want to answer new 
and exciting questions. To answer these questions we need statistics. Statistics is a bit like 
sticking your finger into a revolving fan blade: sometimes it’s very painful, but it does 
give you the power to answer interesting questions. This chapter is going to attempt 
to explain why statistics are an important part of doing research. We will overview the 
whole research process, from why we conduct research in the first place, through how 
theories are generated, to why we need data to test these theories. If that doesn’t con¬ 
vince you to read on then maybe the fact that we discover whether Coca-Cola kills sperm 
will. Or perhaps not. 


1.2. What the hell am I doing here? 

I don’t belong here © 

You’re probably wondering why you have bought this book. Maybe you liked the pic¬ 
tures, maybe you fancied doing some weight training (it is heavy), or perhaps you need 
to reach something in a high place (it is thick). The chances are, though, that given the 
choice of spending your hard-earned cash on a statistics book or something more enter¬ 
taining (a nice novel, a trip to the cinema, etc.) you’d choose the latter. So, why have you 
bought the book (or downloaded an illegal pdf of it from someone who has way too much 
time on their hands if they can scan a 1000-page textbook)? It’s likely that you obtained 
it because you’re doing a course on statistics, or you’re doing some research, and you 
need to know how to analyse data. It’s possible that you didn’t realize when you started 
your course or research that you’d have to know this much about statistics but now find 
yourself inexplicably wading, neck high, through the Victorian sewer that is data analysis. 
The reason you’re in the mess that you find yourself in is because you have a curious 
mind. You might have asked yourself questions like why people behave the way they 
do (psychology), why behaviours differ across cultures (anthropology), how businesses 
maximize their profit (business), how the dinosaurs died (palaeontology), does eating 
tomatoes protect you against cancer (medicine, biology), is it possible to build a quantum 
computer (physics, chemistry), is the planet hotter than it used to be and in what regions 
(geography, environmental studies)? Whatever it is you’re studying or researching, the 
reason you’re studying it is probably because you’re interested in answering questions. 
Scientists are curious people, and you probably are too. However, you might not have 
bargained on the fact that to answer interesting questions, you need two things: data and 
an explanation of those data. 

The answer to ‘what the hell are you doing here?’ is, therefore, simple: to answer 
interesting questions you need data. Therefore, one of the reasons why your evil sta¬ 
tistics lecturer is forcing you to learn about numbers is because they are a form of data 
and are vital to the research process. Of course there are forms of data other than 
numbers that can be used to test and generate theories. When numbers are involved 
the research involves quantitative methods, but you can also generate and test theories 
by analysing language (such as conversations, magazine articles, media broadcasts and so on). 


1 In the 1970s fans didn’t have helpful protective cages around them to prevent idiotic 3-year-olds sticking their 
fingers into the blades. 
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This involves qualitative methods and it is a topic for another book not written by me. 
People can get quite passionate about which of these methods is best , which is a bit 
silly because they are complementary, not competing, approaches and there are much 
more important issues in the world to get upset about. Having said that, all qualitative 
research is rubbish. 2 

How do you go about answering an interesting question? The research proc¬ 
ess is broadly summarized in Figure 1.2. You begin with an observation that you 
want to understand, and this observation could be anecdotal (you’ve noticed 
that your cat watches birds when they’re on TV but not when jellyfish are on) 3 
or could be based on some data (you’ve got several cat owners to keep diaries 
of their cat’s TV habits and have noticed that lots of them watch birds on TV). 

From your initial observation you generate explanations, or theories, of those 
observations, from which you can make predictions (hypotheses). Here’s where 
the data come into the process because to test your predictions you need data. 

First you collect some relevant data (and to do that you need to identify things 
that can be measured) and then you analyse those data. The analysis of the data 
may support your theory or give you cause to modify the theory. As such, the processes of 
data collection and analysis and generating theories are intrinsically linked: theories lead to 
data collection/analysis and data collection/analysis informs theories! This chapter explains 
this research process in more detail. 




FIGURE 1.2 

The research 
process 


2 This is a joke. I thought long and hard about whether to include it because, like many of my jokes, there are 
people who won’t find it remotely funny. Its inclusion is also making me fear being hunted down and forced to eat 
my own entrails by a hoard of rabid qualitative researchers. However, it made me laugh, a lot, and despite being 
vegetarian I’m sure my entrails will taste lovely. 

3 My cat does actually climb up and stare at the TV when it’s showing birds flying about. 
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1.3. Initial observation: finding something that 
needs explaining © 


The first step in Figure 1.2 was to come up with a question that needs an answer. I spend 
rather more time than I should watching reality TV Every year I swear that I won’t get 
hooked on Big Brother, and yet every year I find myself glued to the TV screen waiting 
for the next contestant’s meltdown (I am a psychologist, so really this is just research - 
honestly). One question I am constantly perplexed by is why every year there are so many 
contestants with really unpleasant personalities (my money is on narcissistic personality 
disorder 4 ) on the show. A lot of scientific endeavour starts this way: not by watching Big 
Brother, but by observing something in the world and wondering why it happens. 

Having made a casual observation about the world (Big Brother contestants on the whole 
have profound personality defects), I need to collect some data to see whether this obser¬ 
vation is true (and not just a biased observation). To do this, I need to define one or more 
variables that I would like to measure. There’s one variable in this example: the personal¬ 
ity of the contestant. I could measure this variable by giving them one of the many well- 
established questionnaires that measure personality characteristics. Let’s say that I did this 
and I found that 75% of contestants did have narcissistic personality disorder. These data 
support my observation: a lot of Big Brother contestants have extreme personalities. 


1.4. Generating theories and testing them © 

The next logical thing to do is to explain these data (Figure 1.2). One explanation could be 
that people with narcissistic personality disorder are more likely to audition for Big Brother 
than those without. This is a theory. Another possibility is that the producers of Big Brother 
are more likely to select people who have narcissistic personality disorder to be contestants 
than those with less extreme personalities. This is another theory. We verified our original 
observation by collecting data, and we can collect more data to test our theories. We can 
make two predictions from these two theories. The first is that the number of people turn¬ 
ing up for an audition that have narcissistic personality disorder will be higher than the 
general level in the population (which is about 1%). A prediction from a theory, like this 
one, is known as a hypothesis (see Jane Superbrain Box 1.1). We could test this hypothesis 
by getting a team of clinical psychologists to interview each person at the Big Brother audi¬ 
tion and diagnose them as having narcissistic personality disorder or not. The prediction 
from our second theory is that if the Big Brother selection panel are more likely to choose 
people with narcissistic personality disorder then the rate of this disorder in the final con¬ 
testants will be even higher than the rate in the group of people going for auditions. This is 
another hypothesis. Imagine we collected these data; they are in Table 1.1. 

In total, 7662 people turned up for the audition. Our first hypothesis is that the percent¬ 
age of people with narcissistic personality disorder will be higher at the audition than the 
general level in the population. We can see in the table that of the 7662 people at the audi¬ 
tion, 854 were diagnosed with the disorder; this is about 11% (854/7662 X 100), which is 
much higher than the 1% we’d expect. Therefore, hypothesis 1 is supported by the data. 
The second hypothesis was that the Big Brother selection panel have a bias to chose people 
with narcissistic personality disorder. If we look at the 12 contestants that they selected, 9 
of them had the disorder (a massive 75%). If the producers did not have a bias we would 

4 This disorder is characterized by (among other things) a grandiose sense of self-importance, arrogance, lack of 
empathy for others, envy of others and belief that others envy them, excessive fantasies of brilliance or beauty, the 
need for excessive admiration and exploitation of others. 
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Table 1.1 A table of the number of people at the Big Brother audition split by whether they 
had narcissistic personality disorder and whether they were selected as contestants by the 
producers 



have expected only 11% of the contestants to have the disorder. The data again support 
our hypothesis. Therefore, my initial observation that contestants have personality disor¬ 
ders was verified by data, then my theory was tested using specific hypotheses that were 
also verified using data. Data are very important! 



JANE SUPERBRAIN 1.1 

When is a hypothesis not a hypothesis? © 

A good theory should allow us to make statements about 
the state of the world. Statements about the world are 
good things: they allow us to make sense of our world, 
and to make decisions that affect our future. One current 
example is global warming. Being able to make a defini¬ 
tive statement that global warming is happening, and 
that it is caused by certain practices in society, allows 
us to change these practices and, hopefully, avert catas¬ 
trophe. However, not all statements are ones that can 
be tested using science. Scientific statements are ones 
that can be verified with reference to empirical evidence, 
whereas non-scientific statements are ones that cannot 


be empirically tested. So, statements such as The Led 
Zeppelin reunion concert in London in 2007 was the best 
gig ever’, 5 ‘Lindt chocolate is the best food’ and This is 
the worst statistics book in the world’ are all non-scientific; 
they cannot be proved or disproved. Scientific statements 
can be confirmed or disconfirmed empirically. ‘Watching 
Curb Your Enthusiasm makes you happy’, ‘having sex 
increases levels of the neurotransmitter dopamine’ and 
‘velociraptors ate meat’ are all things that can be tested 
empirically (provided you can quantify and measure the 
variables concerned). Non-scientific statements can 
sometimes be altered to become scientific statements, 
so The Beatles were the most influential band ever' is 
non-scientific (because it is probably impossible to quan¬ 
tify ‘influence’ in any meaningful way) but by changing the 
statement to The Beatles were the best-selling band ever’ 
it becomes testable (we can collect data about worldwide 
record sales and establish whether The Beatles have, in 
fact, sold more records than any other music artist). Karl 
Popper, the famous philosopher of science, believed that 
non-scientific statements were nonsense, and had no 
place in science. Good theories should, therefore, pro¬ 
duce hypotheses that are scientific statements. 


I would now be smugly sitting in my office with a contented grin on my face about how 
my theories and observations were well supported by the data. Perhaps I would quit while 
I was ahead and retire. It’s more likely, though, that having solved one great mystery, my 
excited mind would turn to another. After another few hours (well, days probably) locked 
up at home watching Big Brother I would emerge triumphant with another profound 


5 It was pretty awesome actually. 
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observation, which is that these personality-disordered contestants, despite their obvious 
character flaws, enter the house convinced that the public will love them and that they will 
win. 6 My hypothesis would, therefore, be that if I asked the contestants if they thought that 
they would win, the people with a personality disorder would say yes. 

Let’s imagine I tested my hypothesis by measuring their expectations of success in the 
show, by just asking them, ‘Do you think you will win Big Brother ?’. Let’s say that 7 of 
9 contestants with personality disorders said that they thought that they will win, which 
confirms my observation. Next, I would come up with another theory: these contestants 
think that they will win because they don’t realize that they have a personality disorder. 
My hypothesis would be that if I asked these people about whether their personalities were 
different from other people they would say ‘no’. As before, I would collect some more data 
and perhaps ask those who thought that they would win whether they thought that their 
personalities were different from the norm. All 7 contestants said that they thought their 
personalities were different from the norm. These data seem to contradict my theory. This 
is known as falsification, which is the act of disproving a hypothesis or theory. 

It’s unlikely that we would be the only people interested in why individuals 
who go on Big Brother have extreme personalities and think that they will win. 
Imagine these researchers discovered that: (1) people with narcissistic personal¬ 
ity disorder think that they are more interesting than others; (2) they also think 
that they deserve success more than others; and (3) they also think that others 
like them because they have ‘special’ personalities. 

This additional research is even worse news for my theory: if they didn’t real¬ 
ize that they had a personality different from the norm then you wouldn’t expect 
them to think that they were more interesting than others, and you certainly 
wouldn’t expect them to think that others will like their unusual personalities. 
In general, this means that my theory sucks: it cannot explain all of the data, 
predictions from the theory are not supported by subsequent data, and it cannot 
explain other research findings. At this point I would start to feel intellectually inadequate 
and people would find me curled up on my desk in floods of tears wailing and moaning 
about my failing career (no change there then). 

At this point, a rival scientist, Fester Ingpant-Stain, appears on the scene with a rival 
theory to mine. In his new theory, he suggests that the problem is not that personality-dis¬ 
ordered contestants don’t realize that they have a personality disorder (or at least a person¬ 
ality that is unusual), but that they falsely believe that this special personality is perceived 
positively by other people (put another way, they believe that their personality makes them 
likeable, not dislikeable). One hypothesis from this model is that if personality-disordered 
contestants are asked to evaluate what other people think of them, then they will over¬ 
estimate other people’s positive perceptions. To test this hypothesis, Fester Ingpant-Stain 
collected yet more data. When each contestant came to the diary room 7 they had to fill out 
a questionnaire evaluating all of the other contestants’ personalities, and also answer each 
question as if they were each of the contestants responding about them. (So, for every con¬ 
testant there is a measure of what they thought of every other contestant, and also a meas¬ 
ure of what they believed every other contestant thought of them.) Fie found out that the 
contestants with personality disorders did overestimate their housemates’ view of them; in 
comparison the contestants without personality disorders had relatively accurate impres¬ 
sions of what others thought of them. These data, irritating as it would be for me, support 
the rival theory that the contestants with personality disorders know they have unusual 
personalities but believe that these characteristics are ones that others would feel positive 
about. Fester Ingpant-Stain’s theory is quite good: it explains the initial observations and 



6 One of the things I like about Big Brother in the UK is that year upon year the winner tends to be a nice person, 
which does give me faith that humanity favours the nice. 

7 The diary room is a private room in the house where contestants can talk to ‘big brother’ about whatever is on 
their mind. 
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brings together a range of research findings. The end result of this whole process (and my 
career) is that we should be able to make a general statement about the state of the world. 
In this case we could state: ‘Big Brother contestants who have personality disorders overes¬ 
timate how much other people like their personality characteristics’. 



SELF-TEST 

s Based on what you have read in this section, 
what qualities do you think a scientific theory 
should have? 


1.5. Data collection 1: what to measure © 


We have seen already that data collection is vital for testing theories. When we collect data 
we need to decide on two things: (1) what to measure, (2) how to measure it. This section 
looks at the first of these issues. 


| Variables © 


1.5.1.1. Independent and dependent variables 0 

To test hypotheses we need to measure variables. Variables are just things that can change 
(or vary); they might vary between people (e.g., IQ, behaviour) or locations (e.g., unem¬ 
ployment) or even time (e.g., mood, profit, number of cancerous cells). Most hypotheses 
can be expressed in terms of two variables: a proposed cause and a proposed outcome. For 
example, if we take the scientific statement ‘Coca-Cola is an effective spermicide’ 8 then the 
proposed cause is Coca-Cola and the proposed effect is dead sperm. Both the cause and the 
outcome are variables: for the cause we could vary the type of drink, and for the outcome 
these drinks will kill different amounts of sperm. The key to testing such statements is to 
measure these two variables. 

A variable that we think is a cause is known as an independent variable (because its value 
does not depend on any other variables). A variable that we think is an effect is called a 
dependent variable because the value of this variable depends on the cause (independent 
variable). These terms are very closely tied to experimental methods in which the cause is 
actually manipulated by the experimenter (as we will see in section 1.6.2). In cross-sectional 
research we don’t manipulate any variables and we cannot make causal statements about the 
relationships between variables, so it doesn’t make sense to talk of dependent and independ¬ 
ent variables because all variables are dependent variables in a sense. One possibility is to 
abandon the terms dependent and independent variable and use the terms predictor variable 
and outcome variable. In experimental work the cause, or independent variable, is a predic¬ 
tor, and the effect, or dependent variable, is simply an outcome. This terminology also suits 
cross-sectional work where, statistically at least, we can use one or more variables to make 
predictions about the other(s) without needing to imply causality. 


8 Actually, there is a long-standing urban myth that a post-coital douche with the contents of a bottle of Coke is 
an effective contraceptive. Unbelievably, this hypothesis has been tested and Coke does affect sperm motility, and 
different types of Coke are more or less effective - Diet Coke is best apparently (Umpierre, Hill, &c Anderson, 
1985). Nevertheless, a Coke douche is ineffective at preventing pregnancy. 
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CRAMMING SAM’S TIPS 


Some important terms 


When doing research there are some important generic terms for variables that you will encounter: 


• Independent variable: A variable thought to be the cause of some effect. This term is usually used in experimental research 
to denote a variable that the experimenter has manipulated. 

• Dependent variable: A variable thought to be affected by changes in an independent variable. You can think of this variable 
as an outcome. 

• Predictor variable: A variable thought to predict an outcome variable. This is basically another term for independent vari¬ 
able (although some people won’t like me saying that; I think life would be easier if we talked only about predictors and 
outcomes). 

• Outcome variable: A variable thought to change as a function of changes in a predictor variable. This term could be synony¬ 
mous with ‘dependent variable’ for the sake of an easy life. 


I.5.I.2. Levels of measurement © 


As we have seen in the examples so far, variables can take on many different forms and levels 
of sophistication. The relationship between what is being measured and the numbers that 
represent what is being measured is known as the level of measurement. Broadly speaking, 
variables can be categorical or continuous, and can have different levels of measurement. 

A categorical variable is made up of categories. A categorical variable that you should be 
familiar with already is your species (e.g., human, domestic cat, fruit bat, etc.). You are a 
human or a cat or a fruit bat: you cannot be a bit of a cat and a bit of a bat, and neither a 
batman nor (despite many fantasies to the contrary) a catwoman (not even one in a nice 
PVC suit) exist. A categorical variable is one that names distinct entities. In its simplest 
form it names just two distinct types of things, for example male or female. This is known 
as a binary variable. Other examples of binary variables are being alive or dead, pregnant 
or not, and responding ‘yes’ or ‘no’ to a question. In all cases there are just two categories 
and an entity can be placed into only one of the two categories. 

When two things that are equivalent in some sense are given the same name (or number), 
but there are more than two possibilities, the variable is said to be a nominal variable. It 
should be obvious that if the variable is made up of names it is pointless to do arithmetic 
on them (if you multiply a human by a cat, you do not get a hat). However, sometimes 
numbers are used to denote categories. For example, the numbers worn by players in a 
rugby team. In rugby, the numbers of shirts denote specific field positions, so the number 
10 is always worn by the fly-half (e.g., England’s Jonny Wilkinson), 9 and the number 2 is 
always the hooker (the ugly-looking player at the front of the scrum). These numbers do 
not tell us anything other than what position the player plays. We could equally have shirts 
with FH and H instead of 10 and 1. A number 10 player is not necessarily better than a 
number 1 (most managers would not want their fly-half stuck in the front of the scrum!). 
It is equally as daft to try to do arithmetic with nominal scales where the categories are 
denoted by numbers: the number 10 takes penalty kicks, and if the England coach found 
that Jonny Wilkinson (his number 10) was injured he would not get his number 4 to give 
number 6 a piggy-back and then take the kick. The only way that nominal data can be used 
is to consider frequencies. For example, we could look at how frequently number 10s score 
tries compared to number 4s. 


9 Unlike, for example, NFL American football where a quarterback could wear any number from 1 to 19. 
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Self-report data © 

A lot of self-report data are ordinal. Imagine if two judges 
on The X Factor were asked to rate Billie’s singing on 
a 10-point scale. We might be confident that a judge 


who gives a rating of 10 found Billie more talented than 
one who gave a rating of 2, but can we be certain that 
the first judge found her five times more talented than 
the second? What about if both judges gave a rating 
of 8: could we be sure they found her equally talented? 
Probably not: their ratings will depend on their subjec¬ 
tive feelings about what constitutes talent (the quality of 
singing? showmanship? dancing?). For these reasons, 
in any situation in which we ask people to rate some¬ 
thing subjective (e.g., rate their preference for a product, 
their confidence about an answer, how much they have 
understood some medical instructions) we should prob¬ 
ably regard these data as ordinal although many scien¬ 
tists do not. 


So far the categorical variables we have considered have been unordered (e.g., differ¬ 
ent brands of Coke with which you’re trying to kill sperm), but they can be ordered too 
(e.g., increasing concentrations of Coke with which you’re trying to skill sperm). When 
categories are ordered, the variable is known as an ordinal variable. Ordinal data tell us 
not only that things have occurred, but also the order in which they occurred. However, 
these data tell us nothing about the differences between values. The X Factor is a TV 
show that is broadcast across the globe in which hopeful singers compete to win a record¬ 
ing contract. It is a hugely popular show, which could (if you take a depressing view) 
reflect the fact that Western society values ‘luck’ more than hard work. (This comment 
in no way reflects my bitterness at spending years learning musical instruments and try¬ 
ing to create orginal music, only to be beaten to musical fame and fortune by a 15-year- 
old who can sing other people’s songs, a bit.) Anyway, imagine the three winners of a 
particular X Factor series were Billie, Freema and Elizabeth. The names of the winners 
don’t provide any information about where they came in the contest; however, labelling 
them according to their performance does - first, second and third. These categories are 
ordered. In using ordered categories we now know that the woman who won was better 
than the women who came second and third. We still know nothing about the differences 
between categories, though. We don’t, for example, know how much better the winner 
was than the runners-up: Billie might have been an easy victor, getting many more votes 
than Freema and Elizabeth, or it might have been a very close contest that she won by 
only a single vote. Ordinal data, therefore, tell us more than nominal data (they tell us 
the order in which things happened) but they still do not tell us about the differences 
between points on a scale. 

The next level of measurement moves us away from categorical variables and into con¬ 
tinuous variables. A continuous variable is one that gives us a score for each entity and can 
take on any value on the measurement scale that we are using. The first type of continu¬ 
ous variable that you might encounter is an interval variable. Interval data are consider¬ 
ably more useful than ordinal data and most of the statistical tests in this book rely on 
having data measured at this level. To say that data are interval, we must be certain that 
equal intervals on the scale represent equal differences in the property being measured. For 
example, on www.ratemyprofessors.com students are encouraged to rate their lecturers on 
several dimensions (some of the lecturers’ rebuttals of their negative evaluations are worth 
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a look). Each dimension (i.e., helpfulness, clarity, etc.) is evaluated using a 5-point scale. 
For this scale to be interval it must be the case that the difference between helpfulness rat¬ 
ings of 1 and 2 is the same as the difference between say 3 and 4, or 4 and 5. Similarly, the 
difference in helpfulness between ratings of 1 and 3 should be identical to the difference 
between ratings of 3 and 5. Variables like this that look interval (and are treated as interval) 
are often ordinal - see Jane Superbrain Box 1.2. 

Ratio variables go a step further than interval data by requiring that in addition to the 
measurement scale meeting the requirements of an interval variable, the ratios of values 
along the scale should be meaningful. For this to be true, the scale must have a true and 
meaningful zero point. In our lecturer ratings this would mean that a lecturer rated as 4 
would be twice as helpful as a lecturer rated with a 2 (who would also be twice as helpful as 
a lecturer rated as 1!). The time to respond to something is a good example of a ratio vari¬ 
able. When we measure a reaction time, not only is it true that, say, the difference between 
300 and 350 ms (a difference of 50 ms) is the same as the difference between 210 and 
260 ms or 422 and 472 ms, but also it is true that distances along the scale are divisible: a 
reaction time of 200 ms is twice as long as a reaction time of 100 ms and twice as short as 
a reaction time of 400 ms. 



JANE SUPERBRAIN 1.3 

Continuous and discrete variables © 

The distinction between discrete and continuous variables 
can be very blurred. For one thing, continuous variables 


can be measured in discrete terms; for example, when we 
measure age we rarely use nanoseconds but use years (or 
possibly years and months). In doing so we turn a continu¬ 
ous variable into a discrete one (the only acceptable values 
are years). Also, we often treat discrete variables as if they 
were continuous. For example, the number of boyfriends/ 
girlfriends that you have had is a discrete variable (it will be, 
in all but the very weird cases, a whole number). However, 
you might read a magazine that says ‘the average number 
of boyfriends that women in their 20s have has increased 
from 4.6 to 8.9’. This assumes that the variable is continu¬ 
ous, and of course these averages are meaningless: no 
one in their sample actually had 8.9 boyfriends. 


Continuous variables can be, well, continuous (obviously) but also discrete. This is quite 
a tricky distinction (Jane Superbrain Box 1.3). A truly continuous variable can be measured 
to any level of precision, whereas a discrete variable can take on only certain values (usu¬ 
ally whole numbers) on the scale. What does this actually mean? Well, our example in the 
text of rating lecturers on a 5-point scale is an example of a discrete variable. The range of 
the scale is 1-5, but you can enter only values of 1, 2, 3, 4 or 5; you cannot enter a value 
of 4.32 or 2.18. Although a continuum exists underneath the scale (i.e., a rating of 3.24 
makes sense), the actual values that the variable takes on are limited. A continuous variable 
would be something like age, which can be measured at an infinite level of precision (you 
could be 34 years, 7 months, 21 days, 10 hours, 55 minutes, 10 seconds, 100 milliseconds, 
63 microseconds, 1 nanosecond old). 
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CRAMMING SAM’S TIPS 


Levels of measurement 


Variables can be split into categorical and continuous, and within these types there are different levels of 


measurement: 


• Categorical (entities are divided into distinct categories): 

o Binary variable: There are only two categories (e.g., dead or alive). 

o Nominal variable: There are more than two categories (e.g., whether someone is an omnivore, vegetarian, vegan, or 
fruitarian). 

• Ordinal variable: The same as a nominal variable but the categories have a logical order (e.g., whether people got a fail, a 
pass, a merit or a distinction in their exam). 

• Continuous (entities get a distinct score): 

o Interval variable: Equal intervals on the variable represent equal differences in the property being measured (e.g., the 
difference between 6 and 8 is equivalent to the difference between 13 and 15). 
o Ratio variable: The same as an interval variable, but the ratios of scores on the scale must also make sense (e.g., a score 
of 16 on an anxiety scale means that the person is, in reality, twice as anxious as someone scoring 8). 


Measurement error © 


We have seen that to test hypotheses we need to measure variables. Obviously, it’s also 
important that we measure these variables accurately. Ideally we want our measure to be 
calibrated such that values have the same meaning over time and across situations. Weight 
is one example: we would expect to weigh the same amount regardless of who weighs 
us, or where we take the measurement (assuming it’s on Earth and not in an anti-gravity 
chamber). Sometimes variables can be directly measured (profit, weight, height) but in 
other cases we are forced to use indirect measures such as self-report, questionnaires and 
computerized tasks (to name but a few). 

Let’s go back to our Coke as a spermicide example. Imagine we took some Coke and 
some water and added them to two test tubes of sperm. After several minutes, we measured 
the motility (movement) of the sperm in the two samples and discovered no difference. A 
few years passed and another scientist, Dr Jack Q. Late, replicated the study but found that 
sperm motility was worse in the Coke sample. There are two measurement-related issues 
that could explain his success and our failure: (1) Dr Late might have used more Coke in 
the test tubes (sperm might need a critical mass of Coke before they are affected); (2) Dr 
Late measured the outcome (motility) differently than us. 

The former point explains why chemists and physicists have devoted many hours to 
developing standard units of measurement. If you had reported that you’d used 100 ml 
of Coke and 5 ml of sperm, then Dr Late could have ensured that he had used the same 
amount - because millilitres are a standard unit of measurement we would know that Dr 
Late used exactly the same amount of Coke that we used. Direct measurements such as the 
millilitre provide an objective standard: 100 ml of a liquid is known to be twice as much 
as only 50 ml. 

The second reason for the difference in results between the studies could have been to 
do with how sperm motility was measured. Perhaps in our original study we measured 
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motility using absorption spectrophotometry, whereas Dr Late used laser light-scattering 
techniques. 10 Perhaps his measure is more sensitive than ours. 

There will often be a discrepancy between the numbers we use to represent the thing 
we’re measuring and the actual value of the thing we’re measuring (i.e., the value we would 
get if we could measure it directly). This discrepancy is known as measurement error. For 
example, imagine that you know as an absolute truth that you weigh 80 kg. One day 
you step on the bathroom scales and it says 83 kg. There is a difference of 3 kg between 
your actual weight and the weight given by your measurement tool (the scales): there is a 
measurement error of 3 kg. Although properly calibrated bathroom scales should produce 
only very small measurement errors (despite what we might want to believe when it says 
we have gained 3 kg), self-report measures do produce measurement error because factors 
other than the one you’re trying to measure will influence how people respond to our 
measures. Imagine you were completing a questionnaire that asked you whether you had 
stolen from a shop. If you had, would you admit it, or might you be tempted to conceal 
this fact? 


| Validity and reliability © 


One way to try to ensure that measurement error is kept to a minimum is to determine 
properties of the measure that give us confidence that it is doing its job properly. The first 
property is validity, which is whether an instrument actually measures what it sets out to 
measure. The second is reliability, which is whether an instrument can be interpreted con¬ 
sistently across different situations. 

Validity refers to whether an instrument measures what it was designed to measure; 
a device for measuring sperm motility that actually measures sperm count is not valid. 
Things like reaction times and physiological measures are valid in the sense that a reaction 
time does in fact measure the time taken to react and skin conductance does measure the 
conductivity of your skin. However, if we’re using these things to infer other things (e.g., 
using skin conductance to measure anxiety) then they will be valid only if there are no 
other factors other than the one we’re interested in that can influence them. 

Criterion validity is whether the instrument is measuring what it claims to measure (does 
your lecturer helpfulness rating scale actually measure lecturers’ helpfulness?). In an ideal 
world, you could assess this by relating scores on your measure to real-world observations. 
For example, we could take an objective measure of how helpful lecturers were and com¬ 
pare these observations to students’ ratings on ratemyprofessor.com. This is often imprac¬ 
tical and, of course, with attitudes you might not be interested in the reality so much as 
the person’s perception of reality (you might not care whether they are a psychopath but 
whether they think they are a psychopath). With self-report measures/questionnaires we 
can also assess the degree to which individual items represent the construct being meas¬ 
ured, and cover the full range of the construct (content validity). 

Validity is a necessary but not sufficient condition of a measure. A second consideration 
is reliability, which is the ability of the measure to produce the same results under the same 
conditions. To be valid the instrument must first be reliable. The easiest way to assess reli¬ 
ability is to test the same group of people twice: a reliable instrument will produce similar 
scores at both points in time (test-retest reliability). Sometimes, however, you will want to 
measure something that does vary over time (e.g., moods, blood-sugar levels, productiv¬ 
ity). Statistical methods can also be used to determine reliability (we will discover these in 
Chapter 17). 

10 In the course of writing this chapter I have discovered more than I think is healthy about the measurement of 
sperm motility. 
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SELF-TEST 

s What is the difference between reliability and validity? 


1.6. Data collection 2: how to measure © 


| Correlational research methods © 


So far we’ve learnt that scientists want to answer questions, and that to do this they have 
to generate data (be they numbers or words), and to generate good data they need to use 
accurate measures. We move on now to look briefly at how the data are collected. If we 
simplify things quite a lot then there are two ways to test a hypothesis: either by observing 
what naturally happens, or by manipulating some aspect of the environment and observing 
the effect it has on the variable that interests us. 

The main distinction between what we could call correlational or cross-sectional research 
(where we observe what naturally goes on in the world without directly interfering with it) 
and experimental research (where we manipulate one variable to see its effect on another) 
is that experimentation involves the direct manipulation of variables. In correlational 
research we do things like observe natural events or we take a snapshot of many vari¬ 
ables at a single point in time. As some examples, we might measure pollution levels in a 
stream and the numbers of certain types of fish living there; lifestyle variables (smoking, 
exercise, food intake) and disease (cancer, diabetes); workers’ job satisfaction under differ¬ 
ent managers; or children’s school performance across regions with different demograph¬ 
ics. Correlational research provides a very natural view of the question we’re researching 
because we are not influencing what happens and the measures of the variables should not 
be biased by the researcher being there (this is an important aspect of ecological validity). 

At the risk of sounding like I’m absolutely obsessed with using Coke as a contraceptive 
(I’m not, but my discovery that people in the 1950s and 1960s actually tried this has, I 
admit, intrigued me), let’s return to that example. If we wanted to answer the question ‘Is 
Coke an effective contraceptive?’ we could administer questionnaires about sexual prac¬ 
tices (quantity of sexual activity, use of contraceptives, use of fizzy drinks as contracep¬ 
tives, pregnancy, etc.). By looking at these variables we could see which variables predict 
pregnancy, and in particular whether those reliant on Coca-Cola as a form of contraceptive 
were more likely to end up pregnant than those using other contraceptives, and less likely 
than those using no contraceptives at all. This is the only way to answer a question like this 
because we cannot manipulate any of these variables particularly easily. Even if we could, 
it would be totally unethical to insist on some people using Coke as a contraceptive (or 
indeed to do anything that would make a person likely to produce a child that they didn’t 
intend to produce). However, there is a price to pay, which relates to causality. 


| Experimental research methods © 


Most scientific questions imply a causal link between variables; we have seen already that 
dependent and independent variables are named such that a causal connection is implied 
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(the dependent variable depends on the independent variable). Sometimes the causal link 
is very obvious, as in the research question ‘Does low self-esteem cause dating anxiety?’. 
Sometimes the implication might be subtler - for example, in the question ‘Is dating anxi¬ 
ety all in the mind?’ the implication is that a person’s mental outlook causes them to be 
anxious when dating. Even when the cause-effect relationship is not explicitly stated, most 
research questions can be broken down into a proposed cause (in this case mental outlook) 
and a proposed outcome (dating anxiety). Both the cause and the outcome are variables: 
for the cause some people will perceive themselves in a negative way (so it is something 
that varies); and for the outcome, some people will get anxious on dates and others won’t 
(again, this is something that varies). The key to answering the research question is to 
uncover how the proposed cause and jl4the proposed outcome relate to each other; is it 
the case that the people who have a low opinion of themselves are the same people that 
get anxious on dates? 

David Hume (see Hume, 1739-40, 1748, for more detail), 11 an influ¬ 
ential philosopher, said that to infer cause and effect: (1) cause and 
effect must occur close together in time (contiguity); (2) the cause must 
occur before an effect does; and (3) the effect should never occur with¬ 
out the presence of the cause. These conditions imply that causality can 
be inferred through corroborating evidence: cause is equated to high 
degrees of correlation between contiguous events. In our dating example, 
to infer that low self-esteem caused dating anxiety, it would be sufficient 
to find that whenever someone had low self-esteem they would feel anx¬ 
ious when on a date, that the low self-esteem emerged before the dating 
anxiety did, and that the person should never have dating anxiety if they 
haven’t been suffering from low self-esteem. 

In the previous section on correlational research, we saw that variables are often meas¬ 
ured simultaneously. The first problem with doing this is that it provides no information 
about the contiguity between different variables: we might find from a questionnaire study 
that people with low self-esteem also have dating anxiety but we wouldn’t know whether 
the low self-esteem or the dating anxiety came first! 

Let’s imagine that we find that there are people who have low self-esteem but do not get 
dating anxiety. This finding doesn’t violate Hume’s rules: he doesn’t say anything about 
the cause happening without the effect. It could be that both low self-esteem and dating 
anxiety are caused by a third variable (e.g., poor social skills which might make you feel 
generally worthless but also put pressure on you in dating situations). This illustrates a sec¬ 
ond problem with correlational evidence: the tertium quid (‘a third person or thing of inde¬ 
terminate character’). For example, a correlation has been found between having breast 
implants and suicide (Koot, Peeters, Granath, Grobbee, &C Nyren, 2003). However, it is 
unlikely that having breast implants causes you to commit suicide - presumably, there is an 
external factor (or factors) that causes both; for example, low self-esteem might lead you 
to have breast implants and also attempt suicide. These extraneous factors are sometimes 
called confounding variables or confounds for short. 

The shortcomings of Hume’s criteria led John Stuart Mill (1865) to add a further crite¬ 
rion: that all other explanations of the cause-effect relationship be ruled out. Put simply, 
Mill proposed that, to rule out confounding variables, an effect should be present when the 
cause is present and that when the cause is absent the effect should be absent also. Mill’s 
ideas can be summed up by saying that the only way to infer causality is through compari¬ 
son of two controlled situations: one in which the cause is present and one in which the 
cause is absent. This is what experimental methods strive to do: to provide a comparison of 
situations (usually called treatments or conditions ) in which the proposed cause is present 
or absent. 

11 Both of these can be read online at http://www.utilitarian.net/hume/ or by doing a Google search 
for David Hume. 
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As a simple case, we might want to see what effect motivators have on learning about 
statistics. I might, therefore, randomly split some students into three different groups in 
which I change my style of teaching in the seminars on the course: 

• Group 1 (positive reinforcement): During seminars I congratulate all students in this 
group on their hard work and success. Even when they get things wrong, I am sup¬ 
portive and say things like ‘that was very nearly the right answer, you’re coming 
along really well’ and then give them a nice piece of chocolate. 

• Group 2 (punishment): This group receives seminars in which I give relentless verbal 
abuse to all of the students even when they give the correct answer. I demean their 
contributions and am patronizing and dismissive of everything they say. I tell students 
that they are stupid, worthless and shouldn’t be doing the course at all. 

• Group 3 (no motivator): This group receives normal university style seminars (some 
might argue that this is the same as group 2!). Students are not praised or punished 
and instead I give them no feedback at all. 

The thing that I have manipulated is the teaching method (positive reinforcement, pun¬ 
ishment or no motivator). As we have seen earlier in this chapter, this variable is known 
as the independent variable and in this situation it is said to have three levels, because it 
has been manipulated in three ways (i.e., motivator has been split into three types: positive 
reinforcement, punishment and none). Once I have carried out this manipulation I must 
have some kind of outcome that I am interested in measuring. In this case it is statistical 
ability, and I could measure this variable using a statistics exam after the last seminar. We 
have also already discovered that this outcome variable is known as the dependent vari¬ 
able because we assume that these scores will depend upon the type of teaching method 
used (the independent variable). The critical thing here is the inclusion of the no-motivator 
group because this is a group in which our proposed cause (motivator) is absent, and we 
can compare the outcome in this group against the two situations where the proposed 
cause is present. If the statistics scores are different in each of the motivation groups (cause 
is present) compared to the group for which no motivator was given (cause is absent) then 
this difference can be attributed to the type of motivator used. In other words, the motiva¬ 
tor used caused a difference in statistics scores (Jane Superbrain Box 1.4). 



JANE SUPERBRAIN 1.4 

Causality and statistics © 

People sometimes get confused and think that certain 
statistical procedures allow causal inferences and others 
don't. This isn't true, it’s the fact that in experiments we 
manipulate the causal variable systematically to see its 


effect on an outcome (the effect). In correlational research 
we observe the co-occurrence of variables; we do not 
manipulate the causal variable first and then measure the 
effect, therefore we cannot compare the effect when the 
causal variable is present against when it is absent. In 
short, we cannot say which variable causes a change in 
the other; we can merely say that the variables co-occur 
in a certain way. The reason why some people think that 
certain statistical tests allow causal inferences is because 
historically certain tests (e.g., ANOVA, f-tests) have been 
used to analyse experimental research, whereas others 
(e.g., regression, correlation) have been used to ana¬ 
lyse correlational research (Cronbach, 1957). As you’ll 
discover, these statistical procedures are, in fact, math¬ 
ematically identical. 
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1.6.2.1. Two methods of data collection © 


When we collect data in an experiment, we can choose between two methods of data col¬ 
lection. The first is to manipulate the independent variable using different participants. 
This method is the one described above, in which different groups of people take part in 
each experimental condition (a between-groups, between-subjects, or independent design). 
The second method is to manipulate the independent variable using the same participants. 
Simplistically, this method means that we give a group of students positive reinforcement 
for a few weeks and test their statistical abilities and then begin to give this same group 
punishment for a few weeks before testing them again, and then finally giving them no 
motivator and testing them for a third time (a within-subject or repeated-measures design). 
As you will discover, the way in which the data are collected determines the type of test 
that is used to analyse the data. 


I.6.2.2. Two types of variation © 

Imagine we were trying to see whether you could train chimpanzees to run the economy. 
In one training phase they are sat in front of a chimp-friendly computer and press but¬ 
tons which change various parameters of the economy; once these parameters have been 
changed a figure appears on the screen indicating the economic growth resulting from 
those parameters. Now, chimps can’t read (I don’t think) so this feedback is meaningless. 
A second training phase is the same except that if the economic growth is good, they get a 
banana (if growth is bad they do not) - this feedback is valuable to the average chimp. This 
is a repeated-measures design with two conditions: the same chimps participate in condi¬ 
tion 1 and in condition 2. 

Let’s take a step back and think what would happen if we did not introduce an experi¬ 
mental manipulation (i.e., there were no bananas in the second training phase so condition 
1 and condition 2 were identical). If there is no experimental manipulation then we expect 
a chimp’s behaviour to be similar in both conditions. We expect this because external fac¬ 
tors such as age, gender, IQ, motivation and arousal will be the same for both conditions 
(a chimp’s gender etc. will not change from when they are tested in condition 1 to when 
they are tested in condition 2). If the performance measure is reliable (i.e., our test of how 
well they run the economy), and the variable or characteristic that we are measuring (in 
this case ability to run an economy) remains stable over time, then a participant’s perform¬ 
ance in condition 1 should be very highly related to their performance in condition 2. So, 
chimps who score highly in condition 1 will also score highly in condition 2, and those who 
have low scores for condition 1 will have low scores in condition 2. However, performance 
won’t be identical, there will be small differences in performance created by unknown fac¬ 
tors. This variation in performance is known as unsystematic variation. 

If we introduce an experimental manipulation (i.e., provide bananas as feedback in one 
of the training sessions), then we do something different to participants in condition 1 than 
what we do to them in condition 2. So, the only difference between conditions 1 and 2 is 
the manipulation that the experimenter has made (in this case that the chimps get bananas 
as a positive reward in one condition but not in the other). Therefore, any differences 
between the means of the two conditions is probably due to the experimental manipula¬ 
tion. So, if the chimps perform better in one training phase than the other then this has to 
be due to the fact that bananas were used to provide feedback in one training phase but not 
the other. Differences in performance created by a specific experimental manipulation are 
known as systematic variation. 

Now let’s think about what happens when we use different participants - an independ¬ 
ent design. In this design we still have two conditions, but this time different participants 
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participate in each condition. Going back to our example, one group of chimps receives 
training without feedback, whereas a second group of different chimps does receive feed¬ 
back on their performance via bananas. 12 Imagine again that we didn’t have an experimen¬ 
tal manipulation. If we did nothing to the groups, then we would still find some variation 
in behaviour between the groups because they contain different chimps who will vary in 
their ability, motivation, IQ and other factors. In short, the type of factors that were held 
constant in the repeated-measures design are free to vary in the independent-measures 
design. So, the unsystematic variation will be bigger than for a repeated-measures design. 
As before, if we introduce a manipulation (i.e., bananas) then we will see additional vari¬ 
ation created by this manipulation. As such, in both the repeated-measures design and the 
independent-measures design there are always two sources of variation: 


• Systematic variation: This variation is due to the experimenter doing something to all 
of the participants in one condition but not in the other condition. 

• Unsystematic variation: This variation results from random factors that exist between 
the experimental conditions (natural differences in ability, the time of day, etc.). 

The role of statistics is to discover how much variation there is in performance, and then 
to work out how much of this is systematic and how much is unsystematic. 

In a repeated-measures design, differences between two conditions can be caused by 
only two things: (1) the manipulation that was carried out on the participants, or (2) any 
other factor that might affect the way in which a participant performs from one time to 
the next. The latter factor is likely to be fairly minor compared to the influence of the 
experimental manipulation. In an independent design, differences between the two condi¬ 
tions can also be caused by one of two things: (1) the manipulation that was carried out on 
the participants, or (2) differences between the characteristics of the participants allocated 
to each of the groups. The latter factor in this instance is likely to create considerable 
random variation both within each condition and between them. Therefore, the effect 
of our experimental manipulation is likely to be more apparent in a repeated-measures 
design than in a between-group design because in the former unsystematic variation can 
be caused only by differences in the way in which someone behaves at different times. In 
independent designs we have differences in innate ability contributing to the unsystematic 
variation. Therefore, this error variation will almost always be much larger than if the same 
participants had been used. When we look at the effect of our experimental manipulation, 
it is always against a background of ‘noise’ caused by random, uncontrollable differences 
between our conditions. In a repeated-measures design this ‘noise’ is kept to a minimum 
and so the effect of the experiment is more likely to show up. This means that, other things 
being equal, repeated-measures designs have more power to detect effects than independ¬ 
ent designs. 


Randomization © 


In both repeated-measures and independent-measures designs it is important to try to keep 
the unsystematic variation to a minimum. By keeping the unsystematic variation as small 
as possible we get a more sensitive measure of the experimental manipulation. Generally, 
scientists use the randomization of participants to treatment conditions to achieve this goal. 


12 When I say ‘via’ I don’t mean that the bananas developed little banana mouths that opened up and said ‘well 
done old chap, the economy grew that time’ in chimp language. I mean that when they got something right they 
received a banana as a reward for their correct response. 



18 


DISCOVERING STATISTICS USING R 


Many statistical tests work by identifying the systematic and unsystematic sources of varia¬ 
tion and then comparing them. This comparison allows us to see whether the experiment 
has generated considerably more variation than we would have got had we just tested 
participants without the experimental manipulation. Randomization is important because 
it eliminates most other sources of systematic variation, which allows us to be sure that 
any systematic variation between experimental conditions is due to the manipulation of 
the independent variable. We can use randomization in two different ways depending on 
whether we have an independent- or repeated-measures design. 

Let’s look at a repeated-measures design first. When the same people participate in more 
than one experimental condition they are naive during the first experimental condition but 
they come to the second experimental condition with prior experience of what is expected 
of them. At the very least they will be familiar with the dependent measure (e.g., the task 
they’re performing). The two most important sources of systematic variation in this type 
of design are: 


• Practice effects: Participants may perform differently in the second condition because 
of familiarity with the experimental situation and/or the measures being used. 

• Boredom effects: Participants may perform differently in the second condition because 
they are tired or bored from having completed the first condition. 

Although these effects are impossible to eliminate completely, we can ensure that they 
produce no systematic variation between our conditions by counterbalancing the order in 
which a person participates in a condition. 

We can use randomization to determine in which order the conditions are completed. 
That is, we randomly determine whether a participant completes condition 1 before condi¬ 
tion 2, or condition 2 before condition 1. Let’s look at the teaching method example and 
imagine that there were just two conditions: no motivator and punishment. If the same 
participants were used in all conditions, then we might find that statistical ability was 
higher after the punishment condition. However, if every student experienced the punish¬ 
ment after the no-motivator seminars then they would enter the punishment condition 
already having a better knowledge of statistics than when they began the no-motivator 
condition. So, the apparent improvement after punishment would not be due to the experi¬ 
mental manipulation (i.e., it’s not because punishment works), but because participants 
had attended more statistics seminars by the end of the punishment condition compared 
to the no-motivator one. We can use randomization to ensure that the number of statistics 
seminars does not introduce a systematic bias by randomly assigning students to have the 
punishment seminars first or the no-motivator seminars first. 

If we turn our attention to independent designs, a similar argument can be applied. We 
know that different participants participate in different experimental conditions and that 
these participants will differ in many respects (their IQ, attention span, etc.). Although we 
know that these confounding variables contribute to the variation between conditions, 
we need to make sure that these variables contribute to the unsystematic variation and 
not the systematic variation. The way to ensure that confounding variables are unlikely to 
contribute systematically to the variation between experimental conditions is to randomly 
allocate participants to a particular experimental condition. This should ensure that these 
confounding variables are evenly distributed across conditions. 

A good example is the effects of alcohol on personality. You might give one group of 
people 5 pints of beer, and keep a second group sober, and then count how many fights 
each person gets into. The effect that alcohol has on people can be very variable because 
of different tolerance levels: teetotal people can become very drunk on a small amount, 
while alcoholics need to consume vast quantities before the alcohol affects them. Now, 
if you allocated a bunch of teetotal participants to the condition that consumed alcohol, 
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then you might find no difference between them and the sober group (because the teetotal 
participants are all unconscious after the first glass and so can’t become involved in any 
fights). As such, the person’s prior experiences with alcohol will create systematic variation 
that cannot be dissociated from the effect of the experimental manipulation. The best way 
to reduce this eventuality is to randomly allocate participants to conditions. 



SELF-TEST 

s Why is randomization important? 


1.7. Analysing data © 


The final stage of the research process is to analyse the data you have collected. When the 
data are quantitative this involves both looking at your data graphically to see what the 
general trends in the data are, and also fitting statistical models to the data. 


| Frequency distributions © 


Once you’ve collected some data a very useful thing to do is to plot a graph of how many 
times each score occurs. This is known as a frequency distribution, or histogram, which is a 
graph plotting values of observations on the horizontal axis, with a bar showing how many 
times each value occurred in the data set. Frequency distributions can be very useful for 
assessing properties of the distribution of scores. We will find out how to create these types 
of charts in Chapter 4. 

Frequency distributions come in many different shapes and sizes. It is 
quite important, therefore, to have some general descriptions for common 
types of distributions. In an ideal world our data would be distributed sym¬ 
metrically around the centre of all scores. As such, if we drew a vertical 
line through the centre of the distribution then it should look the same on 
both sides. This is known as a normal distribution and is characterized by 
the bell-shaped curve with which you might already be familiar. This shape 
basically implies that the majority of scores lie around the centre of the 
distribution (so the largest bars on the histogram are all around the central 
value). Also, as we get further away from the centre the bars get smaller, 
implying that as scores start to deviate from the centre their frequency 
is decreasing. As we move still further away from the centre our scores 
become very infrequent (the bars are very short). Many naturally occurring 
things have this shape of distribution. For example, most men in the UK are about 175 cm 
tall, 13 some are a bit taller or shorter but most cluster around this value. There will be very 
few men who are really tall (i.e., above 205 cm) or really short (i.e., under 145 cm). An 
example of a normal distribution is shown in Figure 1.3. 



13 I am exactly 180 cm tall. In my home country this makes me smugly above average. However, Pm writing this 
in the Netherlands where the average male height is 185 cm (a massive 10 cm higher than the UK), and where I 
feel like a bit of a dwarf. 
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FIGURE 1.3 

A ‘normal’ 
distribution (the 
curve shows the 
idealized shape) 
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There are two main ways in which a distribution can deviate from normal: (1) lack of 
symmetry (called skew) and (2) pointyness (called kurtosis). Skewed distributions are not 
symmetrical and instead the most frequent scores (the tall bars on the graph) are clustered 
at one end of the scale. So, the typical pattern is a cluster of frequent scores at one end 
of the scale and the frequency of scores tailing off towards the other end of the scale. A 
skewed distribution can be either positively skewed (the frequent scores are clustered at 
the lower end and the tail points towards the higher or more positive scores) or negatively 
skewed (the frequent scores are clustered at the higher end and the tail points towards the 
lower or more negative scores). Figure 1.4 shows examples of these distributions. 
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FIGURE 1.4 A positively (left-hand figure) and negatively (right-hand figure) skewed distribution 
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Distributions also vary in their kurtosis. Kurtosis, despite sounding like some kind of 
exotic disease, refers to the degree to which scores cluster at the ends of the distribution 
(known as the tails) and how pointy a distribution is (but there are other factors that can 
affect how pointy the distribution looks - see Jane Superbrain Box 2.3). A distribution with 
positive kurtosis has many scores in the tails (a so-called heavy-tailed distribution) and is 
pointy. This is known as a leptokurtic distribution. In contrast, a distribution with negative 
kurtosisis is relatively thin in the tails (has light tails) and tends to be flatter than normal. 
This distribution is called platykurtic. Ideally, we want our data to be normally distributed 
(i.e., not too skewed, and not too many or too few scores at the extremes!). For everything 
there is to know about kurtosis read DeCarlo (1997). 

In a normal distribution the values of skew and kurtosis are 0 (i.e., the tails of the dis¬ 
tribution are as they should be). If a distribution has values of skew or kurtosis above or 
below 0 then this indicates a deviation from normal: Figure 1.5 shows distributions with 
kurtosis values of +4 (left panel) and —1 (right panel). 




FIGURE 1.5 Distributions with positive kurtosis (leptokurtic, left) and negative kurtosis (platykurtic, right) 



The centre of a distribution © 


We can also calculate where the centre of a frequency distribution lies (known as the 
central tendency). There are three measures commonly used: the mean, the mode and the 
median. 
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I.7.2.I. The mode © 


The mode is simply the score that occurs most frequently in the data set. This is easy to spot in 
a frequency distribution because it will be the tallest bar! To calculate the mode, simply place 
the data in ascending order (to make life easier), count how many times each score occurs, 
and the score that occurs the most is the mode! One problem with the mode is that it can 
often take on several values. For example, Figure 1.6 shows an example of a distribution with 
two modes (there are two bars that are the highest), which is said to be bimodal. It’s also pos¬ 
sible to find data sets with more than two modes (multimodal). Also, if the frequencies of cer¬ 
tain scores are very similar, then the mode can be influenced by only a small number of cases. 


FIGURE 1.6 

A bimodal 
distribution 
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I.7.2.2. The median 



Another way to quantify the centre of a distribution is to look for the middle 
score when scores are ranked in order of magnitude. This is called the median. 
For example, Facebook is a popular social networking website, in which users 
can sign up to be ‘friends’ of other users. Imagine we looked at the number 
of friends that a selection (actually, some of my friends) of 11 Facebook users 
had. Number of friends: 108, 103, 252, 121, 93, 57, 40, 53, 22, 116, 98. 

To calculate the median, we first arrange these scores into ascending order: 
22, 40, 53, 57, 93, 98, 103, 108, 116, 121, 252. 

Next, we find the position of the middle score by counting the number of 
scores we have collected ( n ), adding 1 to this value, and then dividing by 2. 
With 11 scores, this gives us (n + l)/2 = (11 + l)/2 = 12/2 = 6. Then, we 
find the score that is positioned at the location we have just calculated. So, in 
this example we find the sixth score: 


22, 40, 53, 57, 93,(98,) 103, 108, 116, 121, 252 


Median 
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This works very nicely when we have an odd number of scores (as in this example) but 
when we have an even number of scores there won’t be a middle value. Let’s imagine that 
we decided that because the highest score was so big (more than twice as large as the next 
biggest number), we would ignore it. (For one thing, this person is far too popular and we 
hate them.) We have only 10 scores now. As before, we should rank-order these scores: 
22, 40, 53, 57, 93, 98, 103, 108, 116, 121. We then calculate the position of the middle 
score, but this time it is {n + l)/2 = 11/2 =5.5. This means that the median is halfway 
between the fifth and sixth scores. To get the median we add these two scores and divide 
by 2. In this example, the fifth score in the ordered list was 93 and the sixth score was 98. 
We add these together (93 + 98 = 191) and then divide this value by 2 (191/2 = 95.5). 
The median number of friends was, therefore, 95.5. 

The median is relatively unaffected by extreme scores at either end of the distribution: 
the median changed only from 98 to 95.5 when we removed the extreme score of 252. The 
median is also relatively unaffected by skewed distributions and can be used with ordinal, 
interval and ratio data (it cannot, however, be used with nominal data because these data 
have no numerical order). 


I.7.2.3. The mean © 

The mean is the measure of central tendency that you are most likely to have heard of 
because it is simply the average score and the media are full of average scores. 14 To calculate 
the mean we simply add up all of the scores and then divide by the total number of scores 
we have. We can write this in equation form as: 


n 



n 


(i.i) 


This may look complicated, but the top half of the equation simply means ‘add up all of 
the scores’ (the x just means ‘the score of a particular person’; we could replace the letter i 
with each person’s name instead), and the bottom bit means divide this total by the number 
of scores you have got ( n ). Let’s calculate the mean for the Facebook data. First, we add 
up all of the scores: 


n 

£x, =22 + 40 + 53 + 57 + 93 + 98 + 103 + 108 + 116 + 121 + 253 

i=l 

= 1063 


We then divide by the number of scores (in this case 11): 


n 



n 


1063 

11 


96.64 


The mean is 96.64 friends, which is not a value we observed in our actual data (it would 
be ridiculous to talk of having 0.64 of a friend). In this sense the mean is a statistical model - 
more on this in the next chapter. 


14 I’m writing this on 15 February 2008, and to prove my point the BBC website is running a headline about how 
PayPal estimates that Britons will spend an average of £71.25 each on Valentine’s Day gifts, but uSwitch.com said 
that the average spend would be £22.69! 
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SELF-TEST 

s Compute the mean but excluding the score of 
252. 


If you calculate the mean without our extremely popular person (i.e., excluding the 
value 252), the mean drops to 81.1 friends. One disadvantage of the mean is that it can 
be influenced by extreme scores. In this case, the person with 252 friends on Facebook 
increased the mean by about 15 friends! Compare this difference with that of the median. 
Remember that the median hardly changed if we included or excluded 252, which illus¬ 
trates how the median is less affected by extreme scores than the mean. While we’re being 
negative about the mean, it is also affected by skewed distributions and can be used only 
with interval or ratio data. 

If the mean is so lousy then why do we use it all of the time? One very important reason 
is that it uses every score (the mode and median ignore most of the scores in a data set). 
Also, the mean tends to be stable in different samples. 



The dispersion in a distribution © 


It can also be interesting to try to quantify the spread, or dispersion, of scores in the data. 
The easiest way to look at dispersion is to take the largest score and subtract from it the 
smallest score. This is known as the range of scores. For our Facebook friends data, if we 
order these scores we get 22, 40, 53, 57, 93, 98, 103, 108, 116, 121, 252. The highest 
score is 252 and the lowest is 22; therefore, the range is 252 — 22 = 230. One problem 
with the range is that because it uses only the highest and lowest score it is affected dra¬ 
matically by extreme scores. 


SELF-TEST 

s Compute the range but excluding the score of 
252. 


If you have done the self-test task you’ll see that without the extreme score the range 
drops dramatically from 230 to 99 - less than half the size! 

One way around this problem is to calculate the range when we exclude values at the 
extremes of the distribution. One convention is to cut off the top and bottom 25% of 
scores and calculate the range of the middle 50% of scores - known as the interquartile 
range. Let’s do this with the Facebook data. First we need to calculate what are called quar- 
tiles. Quartiles are the three values that split the sorted data into four equal parts. First we 
calculate the median, which is also called the second quartile, which splits our data into two 
equal parts. We already know that the median for these data is 98. The lower quartile is the 
median of the lower half of the data and the upper quartile is the median of the upper half 
of the data. One rule of thumb is that the median is not included in the two halves when 
they are split (this is convenient if you have an odd number of values), but you can include 
it (although which half you put it in is another question). Figure 1.7 shows how we would 
calculate these values for the Facebook data. Like the median, the upper and lower quartile 
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need not be values that actually appear in the data (like the median, if each half of the data 
had an even number of values in it then the upper and lower quartiles would be the aver¬ 
age of two values in the data set). Once we have worked out the values of the quartiles, we 
can calculate the interquartile range, which is the difference between the upper and lower 
quartile. For the Facebook data this value would be 116—53 = 63. The advantage of the 
interquartile range is that it isn’t affected by extreme scores at either end of the distribu¬ 
tion. Fiowever, the problem with it is that you lose a lot of data (half of it in fact!). 
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FIGURE 1.7 

Calculating 
quartiles and 
the interquartile 
range 



SELF-TEST 

s Twenty-one heavy smokers were put on a 
treadmill at the fastest setting. The time in 
seconds was measured until they fell off from 
exhaustion: 18, 16, 18, 24, 23, 22, 22, 23, 26, 29, 
32, 34, 34, 36, 36, 43, 42, 49, 46, 46, 57 


Compute the mode, median, mean, upper and lower 
quartiles, range and interquartile range 


Using a frequency distribution to go beyond the data © 

Another way to think about frequency distributions is not in terms of how often scores 
actually occurred, but how likely it is that a score would occur (i.e., probability). The 
word ‘probability’ induces suicidal ideation in most people (myself included) so it seems 
fitting that we use an example about throwing ourselves off a cliff. Beachy Head is a large, 
windy cliff on the Sussex coast (not far from where I live) that has something of a reputa¬ 
tion for attracting suicidal people, who seem to like throwing themselves off it (and after 
several months of rewriting this book I find my thoughts drawn towards that peaceful 
chalky cliff top more and more often). Figure 1.8 shows a frequency distribution of some 
completely made-up data of the number of suicides at Beachy Head in a year by people of 
different ages (although I made these data up, they are roughly based on general suicide 
statistics such as those in Williams, 2001). There were 172 suicides in total and you can 
see that the suicides were most frequently aged between about 30 and 35 (the highest 
bar). The graph also tells us that, for example, very few people aged above 70 committed 
suicide at Beachy Head. 

I said earlier that we could think of frequency distributions in terms of probability. To 
explain this, imagine that someone asked you ‘How likely is it that a person who commit¬ 
ted suicide at Beachy Head is 70 years old?’ What would your answer be? The chances are 
that if you looked at the frequency distribution you might respond ‘not very likely’ because 
you can see that only 3 people out of the 172 suicides were aged around 70. What about 
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FIGURE 1.8 

Frequency 
distribution 
showing the 
number of 
suicides at 
Beachy Head in a 
year by age 
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if someone asked you ‘how likely is it that a 30-year-old committed suicide?’ Again, by 
looking at the graph, you might say ‘it’s actually quite likely’ because 33 out of the 172 
suicides were by people aged around 30 (that’s more than 1 in every 5 people who commit¬ 
ted suicide). So based on the frequencies of different scores it should start to become clear 
that we could use this information to estimate the probability that a particular score will 
occur. We could ask, based on our data, ‘what’s the probability of a suicide victim being 
aged 16-20?’ A probability value can range from 0 (there’s no chance whatsoever of the 
event happening) to 1 (the event will definitely happen). So, for example, when I talk to my 
publishers I tell them there’s a probability of 1 that I will have completed the revisions to 
this book by April 2011. However, when I talk to anyone else, I might, more realistically, 
tell them that there’s a .10 probability of me finishing the revisions on time (or put another 
way, a 10% chance, or 1 in 10 chance that I’ll complete the book in time). In reality, the 
probability of my meeting the deadline is 0 (not a chance in hell) because I never manage 
to meet publisher’s deadlines! If probabilities don’t make sense to you then just ignore the 
decimal point and think of them as percentages instead (i.e., .10 probability that something 
will happen = 10% chance that something will happen). 

I’ve talked in vague terms about how frequency distributions can be used to get a rough 
idea of the probability of a score occurring. However, we can be precise. For any distribu¬ 
tion of scores we could, in theory, calculate the probability of obtaining a score of a certain 
size - it would be incredibly tedious and complex to do it, but we could. To spare our 
sanity, statisticians have identified several common distributions. For each one they have 
worked out mathematical formulae that specify idealized versions of these distributions 
(they are specified in terms of a curved line). These idealized distributions are known as 
probability distributions and from these distributions it is possible to calculate the prob¬ 
ability of getting particular scores based on the frequencies with which a particular score 
occurs in a distribution with these common shapes. One of these ‘common’ distributions is 
the normal distribution, which I’ve already mentioned in section 1.7.1. Statisticians have 
calculated the probability of certain scores occurring in a normal distribution with a mean 
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of 0 and a standard deviation of 1. Therefore, if we have any data that are 
shaped like a normal distribution, then if the mean and standard deviation 
are 0 and 1 respectively we can use the tables of probabilities for the normal 
distribution to see how likely it is that a particular score will occur in the data 
(I’ve produced such a table in the Appendix to this book). 

The obvious problem is that not all of the data we collect will have a mean 
of 0 and standard deviation of 1. For example, we might have a data set that 
has a mean of 567 and a standard deviation of 52.98. Luckily any data set 
can be converted into a data set that has a mean of 0 and a standard deviation 
of 1. First, to centre the data around zero, we take each score (X) and sub¬ 
tract from it the mean of all scores (X). Then, we divide the resulting score 
by the standard deviation (s) to ensure the data have a standard deviation of 
1. The resulting scores are known as z-scores and, in equation form, the conversion that 
I’ve just described is: 



What is the 
normal distribution?. 



X-X 

x =- 

s 


( 1 . 2 ) 


The table of probability values that have been calculated for the standard normal dis¬ 
tribution is shown in the Appendix. Why is this table important? Well, if we look at our 
suicide data, we can answer the question ‘What’s the probability that someone who threw 
themselves off Beachy Head was 70 or older?’ First we convert 70 into a z-score. Suppose 
the mean of the suicide scores was 36, and the standard deviation 13; then 70 will become 
(70 —36)/13 = 2.62. We then look up this value in the column labelled ‘Smaller Portion’ 
(i.e., the area above the value 2.62). You should find that the probability is .0044, or, put 
another way, only a 0.44% chance that a suicide victim would be 70 years old or more. By 
looking at the column labelled ‘Bigger Portion’ we can also see the probability that a suicide 
victim was aged 70 or less. This probability is .9956, or, put another way, there’s a 99.56% 
chance that a suicide victim was less than 70 years old. 

Hopefully you can see from these examples that the normal distribution and z-scores 
allow us to go a first step beyond our data in that from a set of scores we can calculate 
the probability that a particular score will occur. So, we can see whether scores of a cer¬ 
tain size are likely or unlikely to occur in a distribution of a particular kind. You’ll see 
just how useful this is in due course, but it is worth mentioning at this stage that certain 
z-scores are particularly important. This is because their value cuts off certain important 
percentages of the distribution. The first important value of z is 1.96 because this cuts 
off the top 2.5% of the distribution, and its counterpart at the opposite end ( — 1.96) cuts 
off the bottom 2.5% of the distribution. As such, taken together, this value cuts off 5% 
of scores, or, put another way, 95% of z-scores lie between —1.96 and 1.96. The other 
two important benchmarks are +2.58 and +3.29, which cut off 1% and 0.1% of scores 
respectively. Put another way, 99% of z-scores he between —2.58 and 2.58, and 99.9% 
of them lie between —3.29 and 3.29. Remember these values because they’ll crop up 
time and time again. 



SELF-TEST 

s Assuming the same mean and standard 

deviation for the Beachy Head example above, 
what’s the probability that someone who threw 
themselves off Beachy Head was 30 or younger? 
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Fitting statistical models to the data © 


Having looked at your data (and there is a lot more information on different ways to do 
this in Chapter 4), the next step is to fit a statistical model to the data. I should really just 
write ‘insert the rest of the book here’, because most of the remaining chapters discuss the 
various models that you can fit to the data. However, I do want to talk here briefly about 
two very important types of hypotheses that are used when analysing the data. Scientific 
statements, as we have seen, can be split into testable hypotheses. The hypothesis or pre¬ 
diction that comes from your theory is usually saying that an effect will be present. This 
hypothesis is called the alternative hypothesis and is denoted by H r (It is sometimes also 
called the experimental hypothesis but because this term relates to a specific type of meth¬ 
odology it’s probably best to use ‘alternative hypothesis’.) There is another type of hypoth¬ 
esis, though, and this is called the null hypothesis and is denoted by H Q . This hypothesis is 
the opposite of the alternative hypothesis and so would usually state that an effect is absent. 
Taking our Big Brother example from earlier in the chapter we might generate the follow¬ 
ing hypotheses: 

• Alternative hypothesis: Big Brother contestants will score higher on personality disor¬ 
der questionnaires than members of the public. 

• Null hypothesis: Big Brother contestants and members of the public will not differ in 
their scores on personality disorder questionnaires. 

The reason that we need the null hypothesis is because we cannot prove the experi¬ 
mental hypothesis using statistics, but we can reject the null hypothesis. If our data give us 
confidence to reject the null hypothesis then this provides support for our experimental 
hypothesis. However, be aware that even if we can reject the null hypothesis, this doesn’t 
prove the experimental hypothesis - it merely supports it. So, rather than talking about 
accepting or rejecting a hypothesis (which some textbooks tell you to do) we should be 
talking about ‘the chances of obtaining the data we’ve collected assuming that the null 
hypothesis is true’. 

Using our Big Brother example, when we collected data from the auditions about the 
contestant’s personalities we found that 75% of them had a disorder. When we analyse our 
data, we are really asking, ‘Assuming that contestants are no more likely to have personal¬ 
ity disorders than members of the public, is it likely that 75% or more of the contestants 
would have personality disorders?’ Intuitively the answer is that the chances are very low: 
if the null hypothesis is true, then most contestants would not have personality disorders 
because they are relatively rare. Therefore, we are very unlikely to have got the data that 
we did if the null hypothesis were true. 

What if we found that only 1 contestant reported having a personality disorder (about 
8%)? If the null hypothesis is true, and contestants are no different in personality than the 
general population, then only a small number of contestants would be expected to have 
a personality disorder. The chances of getting these data if the null hypothesis is true are, 
therefore, higher than before. 

When we collect data to test theories we have to work in these terms: we cannot talk 
about the null hypothesis being true or the experimental hypothesis being true, we can 
only talk in terms of the probability of obtaining a particular set of data if, hypothetically 
speaking, the null hypothesis was true. We will elaborate on this idea in the next chapter. 

Finally, hypotheses can also be directional or non-directional. A directional hypothesis 
states that an effect will occur, but it also states the direction of the effect. For example, 
‘readers will know more about research methods after reading this chapter’ is a one- 
tailed hypothesis because it states the direction of the effect (readers will know more). A 
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non-directional hypothesis states that an effect will occur, but it doesn’t state the direction 
of the effect. For example, ‘readers’ knowledge of research methods will change after they 
have read this chapter’ does not tell us whether their knowledge will improve or get worse. 



What have I discovered about statistics? © 


Actually, not a lot because we haven’t really got to the statistics bit yet. However, we have 
discovered some stuff about the process of doing research. We began by looking at how 
research questions are formulated through observing phenomena or collecting data about 
a ‘hunch’. Once the observation has been confirmed, theories can be generated about why 
something happens. From these theories we formulate hypotheses that we can test. To test 
hypotheses we need to measure things and this leads us to think about the variables that 
we need to measure and how to measure them. Then we can collect some data. The final 
stage is to analyse these data. In this chapter we saw that we can begin by just looking at 
the shape of the data but that ultimately we should end up fitting some kind of statistical 
model to the data (more on that in the rest of the book). In short, the reason that your 
evil statistics lecturer is forcing you to learn statistics is because it is an intrinsic part of the 
research process and it gives you enormous power to answer questions that are interest¬ 
ing; or it could be that they are a sadist who spends their spare time spanking politicians 
while wearing knee-high PVC boots, a diamond-encrusted leather thong and a gimp mask 
(that’ll be a nice mental image to keep with you throughout your course). We also discov¬ 
ered that I was a curious child (you can interpret that either way). As I got older I became 
more curious, but you will have to read on to discover what I was curious about. 


Key terms that I’ve discovered 


Alternative hypothesis 
Between-group design 
Between-subject design 
Bimodal 
Binary variable 
Boredom effect 
Categorical variable 
Central tendency 
Confounding variable 
Content validity 
Continuous variable 
Correlational research 
Counterbalancing 
Criterion validity 
Cross-sectional research 
Dependent variable 
Discrete variable 
Ecological validity 


Experimental hypothesis 
Experimental research 
Falsification 
Frequency distribution 
Histogram 
Hypothesis 
Independent design 
Independent variable 
Interquartile range 
Interval variable 
Kurtosis 
Leptokurtic 

Level of measurement 
Lower quartile 
Mean 

Measurement error 

Median 

Mode 
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Multimodal 

Range 

Negative skew 

Ratio variable 

Nominal variable 

Reliability 

Normal distribution 

Repeated-measures design 

Null hypothesis 

Second quartile 

Ordinal variable 

Skew 

Outcome variable 

Systematic variation 

Platykurtic 

Tertium quid 

Positive skew 

Test-retest reliability 

Practice effect 

Theory 

Predictor variable 

Unsystematic variation 

Probability distribution 

Upper quartile 

Qualitative methods 

Validity 

Quantitative methods 

Variables 

Quartile 

Within-subject design 

Randomization 

z-scores 


Smart Alex’s tasks 



Smart Alex knows everything there is to know about statistics and R. He also likes nothing 
more than to ask people stats questions just so that he can be smug about how much he 
knows. So, why not really annoy him and get all of the answers right! 

• Task 1: What are (broadly speaking) the five stages of the research process?© 

• Task 2: What is the fundamental difference between experimental and correlational 
research?© 



• Task 3: What is the level of measurement of the following variables?© 

a. The number of downloads of different bands’ songs on iTunes. 

b. The names of the bands that were downloaded. 

c. The position in the iTunes download chart. 

d. The money earned by the bands from the downloads. 

e. The weight of drugs bought by the bands with their royalties. 

f. The type of drugs bought by the bands with their royalties. 

g. The phone numbers that the bands obtained because of their fame. 

h. The gender of the people giving the bands their phone numbers. 

i. The instruments played by the band members. 

j. The time they had spent learning to play their instruments. 

• Task 4: Say I own 857 CDs. My friend has written a computer program that uses 
a webcam to scan the shelves in my house where I keep my CDs and measure how 
many I have. His program says that I have 863 CDs. Define measurement error. What 
is the measurement error in my friend’s CD-counting device?© 

• Task 5: Sketch the shape of a normal distribution, a positively skewed distribution 
and a negatively skewed distribution.© 


Answers can be found on the companion website. 
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Further reading 


Field, A. P., & Hole, G. J. (2003). How to design and report experiments. London: Sage. (I am rather 
biased, but I think this is a good overview of basic statistical theory and research methods.) 

Miles, J. N. V, & Banyard, P. (2007). Understanding and using statistics in psychology: a practical 
introduction. London: Sage. (A fantastic and amusing introduction to statistical theory.) 

Wright, D. B., &c London, K. (2009). First steps in statistics (2nd ed.). London: Sage. (This book is a 
very gentle introduction to statistical theory.) 


Interesting real research 


Umpierre, S. A., Hill, J. A., & Anderson, D. J. (1985). Effect of Coke on sperm motility. New 
England journal of Medicine, 313(21), 1351. 





Everything you ever wanted 
to know about statistics 
(well, sort of) 



FIGURE 2.1 

The face of 
innocence... 
but what are the 
hands doing? 



2.1. What will this chapter tell me? © 


As a child grows, it becomes important for them to fit models to the world: to be able to 
reliably predict what will happen in certain situations. This need to build models that accu¬ 
rately reflect reality is an essential part of survival. According to my parents (conveniently 
I have no memory of this at all), while at nursery school one model of the world that I was 
particularly enthusiastic to try out was ‘If I get my penis out, it will be really funny’. No 
doubt to my considerable disappointment, this model turned out to be a poor predictor 
of positive outcomes. Thankfully for all concerned, I soon learnt that the model ‘If I get 
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my penis out at nursery school the teachers and mummy and daddy are going to be quite 
annoyed’ was a better ‘fit’ of the observed data. Fitting models that accurately reflect the 
observed data is important to establish whether a theory is true. You’ll be delighted to 
know that this chapter is all about fitting statistical models (and not about my penis). We 
edge sneakily away from the frying pan of research methods and trip accidentally into the 
fires of statistics hell. We begin by discovering what a statistical model is by using the mean 
as a straightforward example. We then see how we can use the properties of data to go 
beyond the data we have collected and to draw inferences about the world at large. In a 
nutshell, then, this chapter lays the foundation for the whole of the rest of the book, so it’s 
quite important that you read it or nothing that comes later will make any sense. Actually, 
a lot of what comes later probably won’t make much sense anyway because I’ve written it, 
but there you go. 


2.2. Building statistical models © 


We saw in the previous chapter that scientists are interested in discovering something about 
a phenomenon that we assume actually exists (a ‘real-world’ phenomenon). These real- 
world phenomena can be anything from the behaviour of interest rates in the economic 
market to the behaviour of undergraduates at the end-of-exam party. Whatever the phe¬ 
nomenon we desire to explain, we collect data from the real world to test our hypotheses 
about the phenomenon. Testing these hypotheses involves building statistical 
models of the phenomenon of interest. 

The reason for building statistical models of real-world data is best 
explained by an analogy. Imagine an engineer wishes to build a bridge across 
a river. That engineer would be pretty daft if she just built any old bridge, 
because the chances are that it would fall down. Instead, an engineer collects 
data from the real world: she looks at bridges in the real world and sees what 
materials they are made from, what structures they use and so on (she might 
even collect data about whether these bridges are damaged!). She then uses 
this information to construct a model. She builds a scaled-down version of 
the real-world bridge because it is impractical, not to mention expensive, to 
build the actual bridge itself. The model may differ from reality in several 
ways - it will be smaller for a start - but the engineer will try to build a model 
that best fits the situation of interest based on the data available. Once the 
model has been built, it can be used to predict things about the real world: for example, 
the engineer might test whether the bridge can withstand strong winds by placing the 
model in a wind tunnel. It seems obvious that it is important that the model is an 
accurate representation of the real world. Social scientists do much the same thing as 
engineers: they build models of real-world processes in an attempt to predict how these 
processes operate under certain conditions (see Jane Superbrain Box 2.1 below). We 
don’t have direct access to the processes, so we collect data that represent the processes 
and then use these data to build statistical models (we reduce the process to a statisti¬ 
cal model). We then use this statistical model to make predictions about the real-world 
phenomenon. Just like the engineer, we want our models to be as accurate as possible 
so that we can be confident that the predictions we make are also accurate. However, 
unlike engineers we don’t have access to the real-world situation and so we can only 
ever infer things about psychological, societal, biological or economic processes based 
upon the models we build. If we want our inferences to be accurate then the statisti¬ 
cal model we build must represent the data collected (the observed data) as closely as 





34 


DISCOVERING STATISTICS USING R 


possible. The degree to which a statistical model represents the data collected is known 
as the fit of the model. 

Figure 2.2 illustrates the kinds of models that an engineer might build to represent the 
real-world bridge that she wants to create. The first model (a) is an excellent representation 
of the real-world situation and is said to be a good fit (i.e., there are a few small differ¬ 
ences but the model is basically a very good replica of reality). If this model is used to make 
predictions about the real world, then the engineer can be confident that these predictions 
will be very accurate, because the model so closely resembles reality. So, if the model col¬ 
lapses in a strong wind, then there is a good chance that the real bridge would collapse also. 
The second model (b) has some similarities to the real world: the model includes some of 
the basic structural features, but there are some big differences from the real-world bridge 
(namely the absence of one of the supporting towers). This is what we might term a moder¬ 
ate fit (i.e., there are some differences between the model and the data but there are also 
some great similarities). If the engineer uses this model to make predictions about the real 
world then these predictions may be inaccurate and possibly catastrophic (e.g.the model 
predicts that the bridge will collapse in a strong wind, causing the real bridge to be closed 
down, creating 100-mile tailbacks with everyone stranded in the snow; all of which was 
unnecessary because the real bridge was perfectly safe - the model was a bad representa¬ 
tion of reality). We can have some confidence, but not complete confidence, in predictions 
from this model. The final model (c) is completely different from the real-world situation; 
it bears no structural similarities to the real bridge and is a poor fit (in fact, it might more 
accurately be described as an abysmal fit!). As such, any predictions based on this model 
are likely to be completely inaccurate. Extending this analogy to science, we can say that 
it is important when we fit a statistical model to a set of data that this model fits the data 
well. If our model is a poor fit of the observed data then the predictions we make from it 
will be equally poor. 


FIGURE 2.2 

Fitting models 
to real-world 
data (see text for 
details) 
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JANE SUPERBRAIN 2.1 

Types of statistical models © 

As behavioural and social scientists, most of the models 
that we use to describe data tend to be linear models. 
For example, analysis of variance (ANOVA) and regres¬ 
sion are identical systems based on linear models 
(Cohen, 1968), yet they have different names and, in 
psychology at least, are used largely in different contexts 
due to historical divisions in methodology (Cronbach, 
1957). 

A linear model is simply a model that is based upon 
a straight line; this means that we are usually trying to 
summarize our observed data in terms of a straight 
line. Suppose we measured how many chapters of this 
book a person had read, and then measured their spiri¬ 
tual enrichment. We could represent these hypotheti¬ 
cal data in the form of a scatterplot in which each dot 
represents an individual’s score on both variables (see 
section 4.5). Figure 2.3 shows two versions of such a 
graph summarizing the pattern of these data with either 


a straight (left) or curved (right) line. These graphs illus¬ 
trate how we can fit different types of models to the 
same data. In this case we can use a straight line to 
represent our data and it shows that the more chap¬ 
ters a person reads, the less their spiritual enrichment. 
However, we can also use a curved line to summarize 
the data and this shows that when most, or all, of the 
chapters have been read, spiritual enrichment seems 
to increase slightly (presumably because once the 
book is read everything suddenly makes sense - yeah, 
as if!). Neither of the two types of model is necessarily 
correct, but it will be the case that one model fits the 
data better than another and this is why when we use 
statistical models it is important for us to assess how 
well a given model fits the data. 

It’s possible that many scientific disciplines are pro¬ 
gressing in a biased way because most of the models 
that we tend to fit are linear (mainly because books like 
this tend to ignore more complex curvilinear models). This 
could create a bias because most published scientific 
studies are ones with statistically significant results and 
there may be cases where a linear model has been a 
poor fit to the data (and hence the paper was not pub¬ 
lished), yet a non-linear model would have fitted the data 
well. This is why it is useful to plot your data first: plots tell 
you a great deal about what models should be applied 
to data. If your plot seems to suggest a non-linear model 
then investigate this possibility (which is easy for me to 
say when I don’t include such techniques in this book!). 




FIGURE 2.3 

A scatterplot of 
the same data 
with a linear 
model fitted 
(left), and with a 
non-linear model 
fitted (right) 
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2.3. Populations and samples © 


As researchers, we are interested in finding results that apply to an entire population of 
people or things. For example, psychologists want to discover processes that occur in all 
humans, biologists might be interested in processes that occur in all cells, economists want 
to build models that apply to all salaries, and so on. A population can be very general (all 
human beings) or very narrow (all male ginger cats called Bob). Usually, scientists strive to 
infer things about general populations rather than narrow ones. For example, it’s not very 
interesting to conclude that psychology students with brown hair who own a pet hamster 
named George recover more quickly from sports injuries if the injury is massaged (unless, 
like Rene Koning, 1 you happen to be a psychology student with brown hair who has a pet 
hamster named George). However, if we can conclude that everyone’s sports injuries are 
aided by massage this finding has a much wider impact. 

Scientists rarely, if ever, have access to every member of a population. Psychologists can¬ 
not collect data from every human being and ecologists cannot observe every male ginger 
cat called Bob. Therefore, we collect data from a small subset of the population (known as 
a sample) and use these data to infer things about the population as a whole. The bridge¬ 
building engineer cannot make a full-size model of the bridge she wants to build and so 
she builds a small-scale model and tests this model under various conditions. From the 
results obtained from the small-scale model the engineer infers things about how the full- 
sized bridge will respond. The small-scale model may respond differently than a full-sized 
version of the bridge, but the larger the model, the more likely it is to behave in the same 
way as the full-size bridge. This metaphor can be extended to scientists. We never have 
access to the entire population (the real-size bridge) and so we collect smaller samples 
(the scaled-down bridge) and use the behaviour within the sample to infer things about 
the behaviour in the population. The bigger the sample, the more likely it is to reflect the 
whole population. If we take several random samples from the population, each of these 
samples will give us slightly different results. However, on average, large samples should 
be fairly similar. 


2.4. Simple statistical models © 

| The mean: a very simple statistical model © 


One of the simplest models used in statistics is the mean, which we encountered in sec¬ 
tion 1.7.2.3. In Chapter 1 we briefly mentioned that the mean was a statistical model of 
the data because it is a hypothetical value that doesn’t have to be a value that is actually 
observed in the data. For example, if we took five statistics lecturers and measured the 
number of friends that they had, we might find the following data: 1, 2, 3, 3 and 4. If we 
take the mean number of friends, this can be calculated by adding the values we obtained, 
and dividing by the number of values measured: (1 + 2 + 3 + 3 + 4)/5 = 2.6. Now, we 
know that it is impossible to have 2.6 friends (unless you chop someone up with a chain¬ 
saw and befriend their arm, which frankly is probably not beyond your average statistics 
lecturer) so the mean value is a hypothetical value. As such, the mean is a model created to 
summarize our data. 


1 A brown-haired psychology student with a hamster called Sjors (Dutch for George, apparently) who, after 
reading one of my web resources, emailed me to weaken my foolish belief that this is an obscure combination of 
possibilities. 
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Assessing the fit of the mean: sums of squares, 
variance and standard deviations © 


With any statistical model we have to assess the fit (to return to our bridge analogy we need 
to know how closely our model bridge resembles the real bridge that we want to build). 
With most statistical models we can determine whether the model is accurate by looking 
at how different our real data are from the model that we have created. The easiest way 
to do this is to look at the difference between the data we observed and the model fitted. 
Figure 2.4 shows the number of friends that each statistics lecturer had, and also the mean 
number that we calculated earlier on. The line representing the mean can be thought of as 
our model, and the circles are the observed data. The diagram also has a series of vertical 
lines that connect each observed value to the mean value. These lines represent the devi¬ 
ance between the observed data and our model and can be thought of as the error in the 
model. We can calculate the magnitude of these deviances by simply subtracting the mean 
value (x) from each of the observed values (x ; ). 2 For example, lecturer 1 had only 1 friend 
(a glove puppet of an ostrich called Kevin) and so the difference is x t ~ x = 1 — 2.6 = —1.6. 
You might notice that the deviance is a negative number, and this represents the fact that 
our model overestimates this lecturer’s popularity: it predicts that he will have 2.6 friends 
yet in reality he has only 1 friend (bless him!). Now, how can we use these deviances to 
estimate the accuracy of the model? One possibility is to add up the deviances (this would 
give us an estimate of the total error). If we were to do this we would find that (don’t be 
scared of the equations, we will work through them step by step - if you need reminding 
of what the symbols mean there is a guide at the beginning of the book): 

total error = sum of deviances 

= Yj (*; ~x) = (-1.6) + (-0.6) + (0.4) + (0.4) + (1.4) = 0 



So, in effect the result tells us that there is no total error between our model and the 
observed data, so the mean is a perfect representation of the data. Now, this clearly isn’t 
true: there were errors but some of them were positive, some were negative and they have 


2 The x. simply refers to the observed score for the ith person (so the i can be replaced with a number that rep¬ 
resents a particular individual). For these data: for lecturer 1, x. = * = 1; for lecturer 3, X X, — 3; for lecturer 5, 
x. = x s = 4. 


FIGURE 2.4 

Graph showing 
the difference 
between the 
observed number 
of friends that 
each statistics 
lecturer had, and 
the mean number 
of friends 
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simply cancelled each other out. It is clear that we need to avoid the problem of which 
direction the error is in and one mathematical way to do this is to square each error, 3 that 
is multiply each error by itself. So, rather than calculating the sum of errors, we calculate 
the sum of squared errors. In this example: 

sum of squared errors (SS) = 'E,(x j — x)(x— x) 

= (—1.6) 2 + (-0.6) 2 + (0.4) 2 + (0.4) 2 + (1.4) 2 
= 2.56 + 0.36 + 0.16 + 0.16 + 1.96 
= 5.20 

The sum of squared errors (SS) is a good measure of the accuracy of our model. However, it 
is fairly obvious that the sum of squared errors is dependent upon the amount of data that 
has been collected - the more data points, the higher the SS. To overcome this problem 
we calculate the average error by dividing the SS by the number of observations (N). If 
we are interested only in the average error for the sample, then we can divide by N alone. 
However, we are generally interested in using the error in the sample to estimate the error 
in the population and so we divide the SS by the number of observations minus 1 (the rea¬ 
son why is explained in Jane Superbrain Box 2.2). This measure is known as the variance 
and is a measure that we will come across a great deal: 



JANE SUPERBRAIN 2.2 

Degrees of freedom © 


Degrees of freedom (df) are a very difficult concept to 
explain. I’ll begin with an analogy. Imagine you’re the man¬ 
ager of a rugby team and you have a team sheet with 15 
empty slots relating to the positions on the playing field. 
There is a standard formation in rugby and so each team 
has 15 specific positions that must be held constant for 
the game to be played. When the first player arrives, you 
have the choice of 15 positions in which to place him. You 
place his name in one of the slots and allocate him to a 
position (e.g., scrum-half) and, therefore, one position on 
the pitch is now occupied. When the next player arrives, 
you have the choice of 14 positions but you still have the 
freedom to choose which position this player is allocated. 
However, as more players arrive, you will reach the point 
at which 14 positions have been filled and the final player 
arrives. With this player you have no freedom to choose 



where he plays - there is only one position left. Therefore 
there are 14 degrees of freedom; that is, for 14 players 
you have some degree of choice over where they play, but 
for 1 player you have no choice. The degrees of freedom 
are one less than the number of players. 

In statistical terms the degrees of freedom relate to the 
number of observations that are free to vary. If we take 
a sample of four observations from a population, then 
these four scores are free to vary in any way (they can be 
any value). However, if we then use this sample of four 
observations to calculate the standard deviation of the 
population, we have to use the mean of the sample as 
an estimate of the population's mean. Thus we hold one 
parameter constant. Say that the mean of the sample was 
10; then we assume that the population mean is 10 also 
and we keep this value constant. With this parameter fixed, 
can all four scores from our sample vary? The answer is 
no, because to keep the mean constant only three values 
are free to vary. For example, if the values in the sample 
were 8, 9, 11, 12 (mean = 10) and we changed three of 
these values to 7, 15 and 8, then the final value must be 
10 to keep the mean constant. Therefore, if we hold one 
parameter constant then the degrees of freedom must 
be one less than the sample size. This fact explains why 
when we use a sample to estimate the standard deviation 
of a population, we have to divide the sums of squares by 
N - 1 rather than N alone. 


3 When you multiply a negative number by itself it becomes positive. 
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The variance is, therefore, the average error between the mean and the observations 
made (and so is a measure of how well the model fits the actual data). There is one 
problem with the variance as a measure: it gives us a measure in units squared (because 
we squared each error in the calculation). In our example we would have to say that the 
average error in our data (the variance) was 1.3 friends squared. It makes little enough 
sense to talk about 1.3 friends, but it makes even less to talk about friends squared! For 
this reason, we often take the square root of the variance (which ensures that the measure 
of average error is in the same units as the original measure). This measure is known as 
the standard deviation and is simply the square root of the variance. In this example the 
standard deviation is: 


s = 


N -1 


= >/lJ 
= 1.14 


( 2 . 2 ) 


The sum of squares, variance and standard deviation are all, therefore, measures of the 
‘fit’ (i.e., how well the mean represents the data). Small standard deviations (relative to the 
value of the mean itself) indicate that data points are close to the mean. A large standard 
deviation (relative to the mean) indicates that the data points are distant from the mean 
(i.e., the mean is not an accurate representation of the data). A standard deviation of 0 
would mean that all of the scores were the same. Figure 2.5 shows the overall ratings (on 
a 5-point scale) of two lecturers after each of five different lectures. Both lecturers had an 
average rating of 2.6 out of 5 across the lectures. However, the first lecturer had a stan¬ 
dard deviation of 0.55 (relatively small compared to the mean). It should be clear from the 
graph that ratings for this lecturer were consistently close to the mean rating. There was a 
small fluctuation, but generally his lectures did not vary in popularity. As such, the mean 
is an accurate representation of his ratings. The mean is a good fit to the data. The second 
lecturer, however, had a standard deviation of 1.82 (relatively high compared to the mean). 
The ratings for this lecturer are clearly more spread from the mean; that is, for some lec¬ 
tures he received very high ratings, and for others his ratings were appalling. Therefore, 
the mean is not such an accurate representation of his performance because there was a 
lot of variability in the popularity of his lectures. The mean is a poor fit to the data. This 
illustration should hopefully make clear why the standard deviation is a measure of how 
well the mean represents the data. 


Lecturer 1 

Standard Deviation = 0.55 



Lecturer 2 

Standard Deviation : 


1.82 


3 

Lecture 


FIGURE 2.5 

Graphs 

illustrating data 
that have the 
same mean but 
different standard 
deviations 
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SELF-TEST 

V In section 1.7.2.2 we came across some data 
about the number of friends that 11 people had on 
Facebook (22, 40, 53, 57, 93, 98, 103, 108, 116, 

121,252). We calculated the mean for these data as 
96.64. Now calculate the sums of squares, variance 
and standard deviation. 

s Calculate these values again but excluding the 
extreme score (252). 



JANE SUPERBRAIN 2.3 

The standard deviation and the shape of the 
distribution © 

As well as telling us about the accuracy of the mean 
as a model of our data set, the variance and standard 
deviation also tell us about the shape of the distribu¬ 
tion of scores. As such, they are measures of dispersion 
like those we encountered in section 1.7.3. If the mean 


represents the data well then most of the scores will clus¬ 
ter close to the mean and the resulting standard devia¬ 
tion is small relative to the mean. When the mean is a 
worse representation of the data, the scores cluster more 
widely around the mean (think back to Figure 2.5) and 
the standard deviation is larger. Figure 2.6 shows two 
distributions that have the same mean (50) but different 
standard deviations. One has a large standard deviation 
relative to the mean (SD = 25) and this results in a flatter 
distribution that is more spread out, whereas the other 
has a small standard deviation relative to the mean (SD = 
15) resulting in a more pointy distribution in which scores 
close to the mean are very frequent but scores further 
from the mean become increasingly infrequent. The main 
message is that as the standard deviation gets larger, the 
distribution gets fatter. This can make distributions look 
platykurtic or leptokurtic when, in fact, they are not. 


FIGURE 2.6 

Two distributions 
with the same 
mean, but 
large and 
small standard 
deviations 
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Expressing the mean as a model © 


The discussion of means, sums of squares and variance may seem a sidetrack from the ini¬ 
tial point about fitting statistical models, but it’s not: the mean is a simple statistical model 
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that can be fitted to data. What do I mean by this? Well, everything in statistics essentially 
boils down to one equation: 


outcome ; = (model) + error 


(2.3) 


This just means that the data we observe can be predicted from the model we choose to 
fit to the data plus some amount of error. When I say that the mean is a simple statistical 
model, then all I mean is that we can replace the word ‘model’ with the word ‘mean’ in 
that equation. If we return to our example involving the number of friends that statistics 
lecturers have and look at lecturer 1, for example, we observed that they had one friend 
and the mean of all lecturers was 2.6. So, the equation becomes: 

outcome lerturerl = X + 

^ lecturer1 

1 =2.6 + £ lecturerl 


From this we can work out that the error is 1 — 2.6, or —1.6. If we replace this value 
in the equation we get 1 = 2.6 — 1.6 or 1 = 1. Although it probably seems like I’m 
stating the obvious, it is worth bearing this general equation in mind throughout this 
book because if you do you’ll discover that most things ultimately boil down to this one 
simple idea! 

Likewise, the variance and standard deviation illustrate another fundamental concept: 
how the goodness of fit of a model can be measured. If we’re looking at how well a 
model fits the data (in this case our model is the mean) then we generally look at devia¬ 
tion from the model, we look at the sum of squared error, and in general terms we can 
write this as: 


deviation = ^(observed - model) 2 


(2.4) 


Put another way, we assess models by comparing the data we observe to the model we’ve 
fitted to the data, and then square these differences. Again, you’ll come across this funda¬ 
mental idea time and time again throughout this book. 


2.5. Going beyond the data © 


Using the example of the mean, we have looked at how we can fit a statistical model to 
a set of observations to summarize those data. It’s one thing to summarize the data that 
you have actually collected, but usually we want to go beyond our data and say something 
general about the world (remember in Chapter 1 that I talked about how good theories 
should say something about the world). It’s one thing to be able to say that people in our 
sample responded well to medication, or that a sample of high-street stores in Brighton 
had increased profits leading up to Christmas, but it’s more useful to be able to say, based 
on our sample, that all people will respond to medication, or that all high-street stores in 
the UK will show increased profits. To begin to understand how we can make these general 
inferences from a sample of data we can first look not at whether our model is a good fit to 
the sample from which it came, but whether it is a good fit to the population from which 
the sample came. 
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The standard error © 


We’ve seen that the standard deviation tells us something about how well the mean repre¬ 
sents the sample data, but I mentioned earlier on that usually we collect data from samples 
because we don’t have access to the entire population. If you take several samples from a 
population, then these samples will differ slightly; therefore, it’s also important to know 
how well a particular sample represents the population. This is where we use the standard 
error. Many students get confused about the difference between the standard deviation and 
the standard error (usually because the difference is never explained clearly). However, 
the standard error is an important concept to grasp, so I’ll do my best to explain it to you. 

We have already learnt that social scientists use samples as a way of estimating the behav¬ 
iour in a population. Imagine that we were interested in the ratings of all lecturers (so lec¬ 
turers in general were the population). We could take a sample from this population. When 
someone takes a sample from a population, they are taking one of many possible samples. 
If we were to take several samples from the same population, then each sample has its own 
mean, and some of these sample means will be different. 

Figure 2.7 illustrates the process of taking samples from a population. Imagine that we 
could get ratings of all lecturers on the planet and that, on average, the rating is 3 (this is the 
population mean, /a). Of course, we can’t collect ratings of all lecturers, so we use a sample. 
For each of these samples we can calculate the average, or sample mean. Let’s imagine we 
took nine different samples (as in the diagram); you can see that some of the samples have 
the same mean as the population but some have different means: the first sample of lectur¬ 
ers were rated, on average, as 3, but the second sample were, on average, rated as only 2. 
This illustrates sampling variation: that is, samples will vary because they contain different 
members of the population; a sample that by chance includes some very good lecturers 
will have a higher average than a sample that, by chance, includes some awful lecturers! 
We can actually plot the sample means as a frequency distribution, or histogram, 4 just like 
I have done in the diagram. This distribution shows that there were three samples that 
had a mean of 3, means of 2 and 4 occurred in two samples each, and means of 1 and 5 
occurred in only one sample each. The end result is a nice symmetrical distribution known 
as a sampling distribution. A sampling distribution is simply the frequency distribution of 
sample means 5 from the same population. In theory you need to imagine that we’re taking 
hundreds or thousands of samples to construct a sampling distribution, but I’m just using 
nine to keep the diagram simple. 6 The sampling distribution tells us about the behaviour 
of samples from the population, and you’ll notice that it is centred at the same value as the 
mean of the population (i.e., 3). This means that if we took the average of all sample means 
we’d get the value of the population mean. Now, if the average of the sample means is the 
same value as the population mean, then if we knew the accuracy of that average we’d 
know something about how likely it is that a given sample is representative of the popula¬ 
tion. So how do we determine the accuracy of the population mean? 

Think back to the discussion of the standard deviation. We used the standard deviation 
as a measure of how representative the mean was of the observed data. Small standard 
deviations represented a scenario in which most data points were close to the mean, a large 
standard deviation represented a situation in which data points were widely spread from 
the mean. If you were to calculate the standard deviation between sample means then this 
too would give you a measure of how much variability there was between the means of 


4 This is just a graph of each sample mean plotted against the number of samples that has that mean - see section 
1.7.1 for more details. 

5 It doesn’t have to be means, it can be any statistic that you’re trying to estimate, but I’m using the mean to keep 
things simple. 

6 It’s worth pointing out that I’m talking hypothetically. We don’t need to actually collect these samples because 
clever statisticians have worked out what these sampling distributions would look like and how they behave. 
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different samples. The standard deviation of sample means is known as the standard error 
of the mean (SE). Therefore, the standard error could be calculated by taking the difference 
between each sample mean and the overall mean, squaring these differences, adding them 
up, and then dividing by the number of samples. Finally, the square root of this value would 
need to be taken to get the standard deviation of sample means, the standard error. 

Of course, in reality we cannot collect hundreds of samples and so we rely on approxi¬ 
mations of the standard error. Tuckily for us some exceptionally clever statisticians have 
demonstrated that as samples get large (usually defined as greater than 30), the sampling 
distribution has a normal distribution with a mean equal to the population mean, and a 
standard deviation of: 

<2 ' 5) 

This is known as the central limit theorem and it is useful in this context because it means 
that if our sample is large we can use the above equation to approximate the standard error 
(because, remember, it is the standard deviation of the sampling distribution). 7 When the 
sample is relatively small (fewer than 30) the sampling distribution has a different shape, 
known as a f-distribution, which we’ll come back to later. 



sample is likely to be of the population. A large standard error (relative to the sample mean) means that there is a 
lot of variability between the means of different samples and so the sample we have might not be representative of 
the population. A small standard error indicates that most sample means are similar to the population mean and 
so our sample is likely to be an accurate reflection of the population. 


2.5.2. 


Confidence intervals © 


2.5.2.1. Calculating confidence intervals © 

Remember that usually we’re interested in using the sample mean as an estimate of the 
value in the population. We’ve just seen that different samples will give rise to different val¬ 
ues of the mean, and we can use the standard error to get some idea of the extent to which 
sample means differ. A different approach to assessing the accuracy of the sample mean 
as an estimate of the mean in the population is to calculate boundaries within which we 
believe the true value of the mean will fall. Such boundaries are called confidence intervals. 
The basic idea behind confidence intervals is to construct a range of values within which 
we think the population value falls. 

Let’s imagine an example: Domjan, Blesbois, and Williams (1998) examined the learnt 
release of sperm in Japanese quail. The basic idea is that if a quail is allowed to copulate 
with a female quail in a certain context (an experimental chamber) then this context will 
serve as a cue to copulation and this in turn will affect semen release (although during the 


7 In fact it should be the population standard deviation (a) that is divided by the square root of the sample size; 
however, for large samples this is a reasonable approximation. 
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FIGURE 2.7 

Illustration of 
the standard 
error (see text 
for details) 


population 


fi = 3 



Sample Mean 


test phase the poor quail were tricked into copulating with a terry cloth with an embalmed 
female quail head stuck on top). 8 Anyway, if we look at the mean amount of sperm released 
in the experimental chamber, there is a true mean (the mean in the population); let’s 
imagine it’s 15 million sperm. Now, in our actual sample, we might find the mean amount 
of sperm released was 17 million. Because we don’t know the true mean, we don’t really 
know whether our sample value of 17 million is a good or bad estimate of this value. What 
we can do instead is use an interval estimate: we use our sample value as the mid-point, but 
set a lower and upper limit as well. So, we might say, we think the true value of the mean 
sperm release is somewhere between 12 million and 22 million spermatozoa (note that 17 
million falls exactly between these values). Of course, in this case the true value (15 million) 

8 This may seem a bit sick, but the male quails didn’t appear to mind too much, which probably tells us all we 
need to know about male mating behaviour. 









CHAPTER 2 EVERYTHING YOU EVER WANTED TO KNOW ABOUT STATISTICS (WELL, SORT OF) 


45 


does falls within these limits. However, what if we’d set smaller limits, what if we’d said we 
think the true value falls between 16 and 18 million (again, note that 17 million is in the 
middle)? In this case the interval does not contain the true value of the mean. Let’s now 
imagine that you were particularly fixated with Japanese quail sperm, and you repeated the 
experiment 50 times using different samples. Each time you did the experiment again you 
constructed an interval around the sample mean as I’ve just described. Figure 2.8 shows 
this scenario: the circles represent the mean for each sample with the lines sticking out of 
them representing the intervals for these means. The true value of the mean (the mean in 
the population) is 15 million and is shown by a vertical line. The first thing to note is that 
the sample means are different from the true mean (this is because of sampling variation as 
described in the previous section). Second, although most of the intervals do contain the 
true mean (they cross the vertical line, meaning that the value of 15 million spermatozoa 
falls somewhere between the lower and upper boundaries), a few do not. 

Up until now I’ve avoided the issue of how we might calculate the 
intervals. The crucial thing with confidence intervals is to construct 
them in such a way that they tell us something useful. Therefore, we 
calculate them so that they have certain properties: in particular, they 
tell us the likelihood that they contain the true value of the thing we’re 
trying to estimate (in this case, the mean). 

Typically we look at 95% confidence intervals, and sometimes 99% 
confidence intervals, but they all have a similar interpretation: they are 
limits constructed such that for a certain percentage of the time (be that 
95% or 99%) the true value of the population mean will fall within 
these limits. So, when you see a 95% confidence interval for a mean, 
think of it like this: if we’d collected 100 samples, calculated the mean 
and then calculated a confidence interval for that mean (a bit like in Figure 2.8) then for 
95 of these samples, the confidence intervals we constructed would contain the true value 
of the mean in the population. 

To calculate the confidence interval, we need to know the limits within which 95% of 
means will fall. How do we calculate these limits? Remember back in section 1.7.4 that I 
said that 1.96 was an important value of z (a score from a normal distribution with a mean 
of 0 and standard deviation of 1) because 95% of z-scores fall between —1.96 and 1.96. 
This means that if our sample means were normally distributed with a mean of 0 and a 
standard error of 1, then the limits of our confidence interval would be —1.96 and +1.96. 
Luckily we know from the central limit theorem that in large samples (above about 30) the 
sampling distribution will be normally distributed (see section 2.5.1). It’s a pity then that 
our mean and standard deviation are unlikely to be 0 and 1; except not really because, as 
you might remember, we can convert scores so that they do have a mean of 0 and standard 
deviation of 1 (z-scores) using equation (1.2): 



X-X 

z =- 

s 

If we know that our limits are —1.96 and 1.96 in z-scores, then to find out the correspond¬ 
ing scores in our raw data we can replace z in the equation (because there are two values, 
we get two equations): 


1.96 


X-X 

s 


-1.96 


X-X 

s 


We rearrange these equations to discover the value of X: 


1.96 xs = X_-X 
(1.96 x s) + X = X 


-1.96 xs =X_-X 
(-1.96 x s) + X = X 
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FIGURE 2.8 

The confidence 
intervals of the 
sperm counts of 
Japanese quail 
(horizontal axis) 
for 50 different 
samples (vertical 
axis) 
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Therefore, the confidence interval can easily be calculated once the standard deviation (s 
in the equation above) and mean (X in the equation) are known. However, in fact we use 
the standard error and not the standard deviation because we’re interested in the variability 
of sample means, not the variability in observations within the sample. The lower bound¬ 
ary of the confidence interval is, therefore, the mean minus 1.96 times the standard error, 
and the upper boundary is the mean plus 1.96 standard errors: 

lower boundary of confidence interval = X — (1.96 x SE ) 
upper boundary of confidence interval = X+ (1.96 x SE) 

As such, the mean is always in the centre of the confidence interval. If the mean rep¬ 
resents the true mean well, then the confidence interval of that mean should be small. 
We know that 95% of confidence intervals contain the true mean, so we can assume this 
confidence interval contains the true mean; therefore, if the interval is small, the sample 
mean must be very close to the true mean. Conversely, if the confidence interval is very 
wide then the sample mean could be very different from the true mean, indicating that it is 
a bad representation of the population. You’ll find that confidence intervals will come up 
time and time again throughout this book. 
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2.5.2.2. Calculating other confidence intervals (D 


The example above shows how to compute a 95% confidence interval (the most common 
type). However, we sometimes want to calculate other types of confidence interval such as 
a 99% or 90% interval. The —1.96 and 1.96 in the equations above are the limits within 
which 95% of z-scores occur. Therefore, if we wanted a 99% confidence interval we could 
use the values within which 99% of z-scores occur (—2.58 and 2.58). In general, then, we 
could say that confidence intervals are calculated as: 


lower boundary of confidence interval = X — 


“i zP 
2 


\ 

xSE 


upper boundary of confidence interval = X + 


z^ p xSE 


in which p is the probability value for the confidence interval. So, if you want a 95% con¬ 
fidence interval, then you want the value of z for (l-0.95)/2 = 0.025. Look this up in the 
‘smaller portion’ column of the table of the standard normal distribution (see the Appendix) 
and you’ll find that z is 1.96. For a 99% confidence interval we want z for (l-0.99)/2 = 
0.005, which from the table is 2.58. For a 90% confidence interval we wantz for (l-0.90)/2 
= 0.05, which from the table is 1.64. These values of z are multiplied by the standard error 
(as above) to calculate the confidence interval. Using these general principles, we could 
work out a confidence interval for any level of probability that takes our fancy. 


2.5.2.3. Calculating confidence intervals in small samples (D 

The procedure that I have just described is fine when samples are large, but for small 
samples, as I have mentioned before, the sampling distribution is not normal, it has a 
t-distribution. The t-distribution is a family of probability distributions that change shape 
as the sample size gets bigger (when the sample is very big, it has the shape of a normal dis¬ 
tribution). To construct a confidence interval in a small sample we use the same principle 
as before but instead of using the value for z we use the value for t : 

lower boundary of confidence interval = X - (t n _ t x SE) 

upper boundary of confidence interval = X + (t n l x SE) 

The n- 1 in the equations is the degrees of freedom (see Jane Superbrain Box 2.2) and tells 
us which of the ^-distributions to use. For a 95% confidence interval we find the value of t 
for a two-tailed test with probability of .05, for the appropriate degrees of freedom. 



SELF-TEST 

s In section 1.7.2.2 we came across some data 
about the number of friends that 11 people had on 
Facebook. We calculated the mean for these data as 
96.64 and standard deviation as 61.27. Calculate a 
95% confidence interval for this mean. 
s Recalculate the confidence interval assuming that 
the sample size was 56. 
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2.5.2A. Showing confidence intervals visually (D 

Confidence intervals provide us with very important information about the 
mean, and, therefore, you often see them displayed on graphs. (We will discover 
more about how to create these graphs in Chapter 4.) The confidence interval 
is usually displayed using something called an error bar, which just looks like 
the letter ‘I’. An error bar can represent the standard deviation, or the standard 
error, but more often than not it shows the 95% confidence interval of the 
mean. So, often when you see a graph showing the mean, perhaps displayed as 
a bar or a symbol (section 4.9), it is often accompanied by this funny I-shaped 
bar. Why is it useful to see the confidence interval visually? 

We have seen that the 95% confidence interval is an interval constructed such 
that in 95% of samples the true value of the population mean will fall within its 
limits. We know that it is possible that any two samples could have slightly different means 
(and the standard error tells us a little about how different we can expect sample means 
to be). Now, the confidence interval tells us the limits within which the population mean 
is likely to fall (the size of the confidence interval will depend on the size of the standard 
error). By comparing the confidence intervals of different means we can start to get some 
idea about whether the means came from the same population or different populations. 

Taking our previous example of quail sperm, imagine we had a sample of quail and 
the mean sperm release had been 9 million sperm with a confidence interval of 2 to 16. 
Therefore, we know that the population mean is probably between 2 and 16 million sperm. 
What if we now took a second sample of quail and found the confidence interval ranged 
from 4 to 15? This interval overlaps a lot with our first sample: 




Sperm (Millions) 


The fact that the confidence intervals overlap in this way tells us that these means could 
plausibly come from the same population: in both cases the intervals are likely to contain 
the true value of the mean (because they are constructed such that in 95% of studies they 
will), and both intervals overlap considerably, so they contain many similar values. What if 
the confidence interval for our second sample ranges from 18 to 28? If we compared this 
to our first sample we’d get: 



8 10 12 14 16 18 20 22 24 26 28 30 
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Now, these confidence intervals don’t overlap at all. So, one confidence interval, which 
is likely to contain the population mean, tells us that the population mean is somewhere 
between 2 and 16 million, whereas the other confidence interval, which is also likely to 
contain the population mean, tells us that the population mean is somewhere between 18 
and 28. This suggests that either our confidence intervals both do contain the population 
mean, but they come from different populations (and, therefore, so do our samples), or 
both samples come from the same population but one of the confidence intervals doesn’t 
contain the population mean. If we’ve used 95% confidence intervals then we know that 
the second possibility is unlikely (this happens only 5 times in 100 or 5% of the time), so 
the first explanation is more plausible. 

OK, I can hear you all thinking ‘so what if the samples come from a different popula¬ 
tion?’ Well, it has a very important implication in experimental research. When we do an 
experiment, we introduce some form of manipulation between two or more conditions 
(see section 1.6.2). If we have taken two random samples of people, and we have tested 
them on some measure (e.g., fear of statistics textbooks), then we expect these people to 
belong to the same population. If their sample means are so different as to suggest that, 
in fact, they come from different populations, why might this be? The answer is that our 
experimental manipulation has induced a difference between the samples. 

To reiterate, when an experimental manipulation is successful, we expect to find that our 
samples have come from different populations. If the manipulation is unsuccessful, then 
we expect to find that the samples came from the same population (e.g., the sample means 
should be fairly similar). Now, the 95% confidence interval tells us something about the 
likely value of the population mean. If we take samples from two populations, then we 
expect the confidence intervals to be different (in fact, to be sure that the samples were from 
different populations we would not expect the two confidence intervals to overlap). If we 
take two samples from the same population, then we expect, if our measure is reliable, the 
confidence intervals to be very similar (i.e., they should overlap completely with each other). 

This is why error bars showing 95% confidence intervals are so useful on graphs, because 
if the bars of any two means do not overlap then we can infer that these means are from 
different populations - they are significantly different. 



this range in 95% of samples. 

The confidence interval is not an interval within which we are 95% confident that the population mean will fall. 


2.6. Using statistical models to test 
research questions © 


In Chapter 1 we saw that research was a five-stage process: 

1 Generate a research question through an initial observation (hopefully backed up by 
some data). 

2 Generate a theory to explain your initial observation. 
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3 Generate hypotheses: break your theory down into a set of testable predictions. 

4 Collect data to test the theory: decide on what variables you need to measure to test 
your predictions and how best to measure or manipulate those variables. 

5 Analyse the data: fit a statistical model to the data - this model will test your 
original predictions. Assess this model to see whether or not it supports your initial 
predictions. 

This chapter has shown that we can use a sample of data to estimate what’s happening 
in a larger population to which we don’t have access. We have also seen (using the mean 
as an example) that we can fit a statistical model to a sample of data and assess how well 
it fits. However, we have yet to see how fitting models like these can help us to test our 
research predictions. How do statistical models help us to test complex hypotheses such as 
‘is there a relationship between the amount of gibberish that people speak and the amount 
of vodka jelly they’ve eaten?’ or ‘is the mean amount of chocolate I eat higher when I’m 
writing statistics books than when I’m not?’. We’ve seen in section 1.7.5 that hypotheses 
can be broken down into a null hypothesis and an alternative hypothesis. 



SELF-TEST 

s What are the null and alternative hypotheses for the 
following questions: 

1. 'Is there a relationship between the amount of 
gibberish that people speak and the amount of 
vodka jelly they’ve eaten?’ 

2. 'Is the mean amount of chocolate eaten higher when 
writing statistics books than when not?’ 


Most of this book deals with inferential statistics, which tell us whether the alternative 
hypothesis is likely to be true - they help us to confirm or reject our predictions. Crudely 
put, we fit a statistical model to our data that represents the alternative hypothesis and see 
how well it fits (in terms of the variance it explains). If it fits the data well (i.e., explains 
a lot of the variation in scores) then we assume our initial prediction is true: we gain 
confidence in the alternative hypothesis. Of course, we can never be completely sure that 
either hypothesis is correct, and so we calculate the probability that our model would fit if 
there were no effect in the population (i.e., the null hypothesis is true). As this probability 
decreases, we gain greater confidence that the alternative hypothesis is actually correct and 
that the null hypothesis can be rejected. This works provided we make our predictions 
before we collect the data (see Jane Superbrain Box 2.4). 

To illustrate this idea of whether a hypothesis is likely, Fisher (1925/1991) (Figure 2.9) 
describes an experiment designed to test a claim by a woman that she could determine, by 
tasting a cup of tea, whether the milk or the tea was added first to the cup. Fisher thought 
that he should give the woman some cups of tea, some of which had the milk added first 
and some of which had the milk added last, and see whether she could correctly identify 
them. The woman would know that there are an equal number of cups in which milk was 
added first or last but wouldn’t know in which order the cups were placed. If we take the 
simplest situation in which there are only two cups then the woman has a 50% chance of 
guessing correctly. If she did guess correctly we wouldn’t be that confident in concluding 
that she can tell the difference between cups in which the milk was added first from those 
in which it was added last, because even by guessing she would be correct half of the time. 
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JANE SUPERBRAIN 2.4 

Cheating in research © 

The process I describe in this chapter works only if you 
generate your hypotheses and decide on your criteria for 
whether an effect is significant before collecting the data. 
Imagine I wanted to place a bet on who would win the 
Rugby World Cup. Being an Englishman, I might want 
to bet on England to win the tournament. To do this I’d: 
(1) place my bet, choosing my team (England) and odds 
available at the betting shop (e.g., 6/4); (2) see which 
team wins the tournament; (3) collect my winnings (if 
England do the decent thing and actually win). 

To keep everyone happy, this process needs to be 
equitable: the betting shops set their odds such that 
they’re not paying out too much money (which keeps 
them happy), but so that they do pay out sometimes 
(to keep the customers happy). The betting shop can 
offer any odds before the tournament has ended, but it 
can’t change them once the tournament is over (or the 
last game has started). Similarly, I can choose any team 


before the tournament, but I can’t then change my mind 
half way through, or after the final game! 

The situation in research is similar: we can choose any 
hypothesis (rugby team) we like before the data are col¬ 
lected, but we can’t change our minds halfway through 
data collection (or after data collection). Likewise we 
have to decide on our probability level (or betting odds) 
before we collect data, //we do this, the process works. 
However, researchers sometimes cheat. They don't write 
down their hypotheses before they conduct their experi¬ 
ments, sometimes they change them when the data are 
collected (like me changing my team after the World Cup 
is over), or, worse still, decide on them after the data are 
collected! With the exception of some complicated pro¬ 
cedures called post hoc tests, this is cheating. Similarly, 
researchers can be guilty of choosing which significance 
level to use after the data are collected and analysed, like 
a betting shop changing the odds after the tournament. 

Every time that you change your hypothesis or the 
details of your analysis you appear to increase the chance 
of finding a significant result, but in fact you are making 
it more and more likely that you will publish results that 
other researchers can’t reproduce (which is very embar¬ 
rassing!). If, however, you follow the rules carefully and 
do your significance testing at the 5% level you at least 
know that in the long run at most only 1 result out of every 
20 will risk this public humiliation. 

(With thanks to David Hitchin for this box, and with 
apologies to him for turning it into a rugby example!) 


However, what about if we complicated things by having six cups? There are 20 orders 
in which these cups can be arranged and the woman would guess the correct order only 
1 time in 20 (or 5% of the time). If she got the correct order we would be much more 
confident that she could genuinely tell the difference (and bow down in awe of her finely 
tuned palette). If you’d like to know more about Fisher and his tea-tasting antics see David 
Salsburg’s excellent book The Lady Tasting Tea (Salsburg, 2002). For our purposes the 
take-home point is that only when there was a very small probability that the woman could 
complete the tea-task by luck alone would we conclude that she had genuine skill in detect¬ 
ing whether milk was poured into a cup before or after the tea. 

It’s no coincidence that I chose the example of six cups above (where the tea-taster had 
a 5% chance of getting the task right by guessing), because Fisher suggested that 95% is a 
useful threshold for confidence: only when we are 95% certain that a result is genuine (i.e., 
not a chance finding) should we accept it as being true. 9 The opposite way to look at this 
is to say that if there is only a 5% chance (a probability of .05) of something occurring by 
chance then we can accept that it is a genuine effect: we say it is a statistically significant 
finding (see Jane Superbrain Box 2.5 to find out how the criterion of .05 became popular!). 


9 Of course, in reality, it might not be true - we’re just prepared to believe that it is! 
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FIGURE 2.9 
Sir Ronald A. 
Fisher, probably 
the cleverest 
person ever 
(p < .0001) 




JANE SUPERBRAIN 2.5 

Why do we use .05? © 

This criterion of 95% confidence, or a .05 probability, forms 
the basis of modern statistics, and yet there is very little 
justification for it. How it arose is a complicated mystery to 
unravel. The significance testing that we use today is a blend 
of Fisher’s idea of using the probability value p as an index of 
the weight of evidence against a null hypothesis, and Jerzy 
Neyman and Egron Pearson’s idea of testing a null hypoth¬ 
esis against an alternative hypothesis. Fisher objected to 
Neyman’s use of an alternative hypothesis (among other 
things), and Neyman objected to Fisher’s exact probability 
approach (Berger, 2003; Lehmann, 1993). The confusion 
arising from both parties’ hostility to each other’s ideas led 
scientists to create a sort of bastard child of both approaches. 

This doesn’t answer the question of why we use .05. 
Well, it probably comes down to the fact that back in the 
days before computers, scientists had to compare their 
test statistics against published tables of ‘critical values’ 
(they did not have R to calculate exact probabilities for 
them). These critical values had to be calculated by excep¬ 
tionally clever people like Fisher. In his incredibly influen¬ 
tial textbook Statistical Methods for Research Workers 


(Fisher, 1925) 10 Fisher produced tables of these critical 
values, but to save space produced tables for particular 
probability values (.05, .02 and .01). The impact of this 
book should not be underestimated (to get some idea of 
its influence 25 years after publication see Mather, 1951; 
Yates, 1951) and these tables were very frequently used 
- even Neyman and Pearson admitted the influence that 
these tables had on them (Lehmann, 1993). This disas¬ 
trous combination of researchers confused about the 
Fisher and Neyman-Pearson approaches and the avail¬ 
ability of critical values for only certain levels of probability 
led to a trend to report test statistics as being significant 
at the now infamous p < .05 and p < .01 (because critical 
values were readily available at these probabilities). 

However, Fisher acknowledged that the dogmatic 
use of a fixed level of significance was silly: ‘no scientific 
worker has a fixed level of significance at which from year 
to year, and in all circumstances, he rejects hypotheses; 
he rather gives his mind to each particular case in the 
light of his evidence and his ideas’(Fisher, 1956). 

The use of effect sizes (section 2.6.4) strikes a balance 
between using arbitrary cut-off points such as p < .05 
and assessing whether an effect is meaningful within the 
research context. The fact that we still worship at the shrine 
of p < .05 and that research papers are more likely to be 
published if they contain significant results does make 
me wonder about a parallel universe where Fisher had 
woken up in ap < .10 kind of mood. My filing cabinet full 
of research with p just bigger than .05 gets published and I 
am Vice-Chancellor of my university (although, if this were 
true, the parallel universe version of my university would 
be in utter chaos, but it would have a campus full of cats). 


10 You can read this online at http://psychclassics.yorku.ca/Fisher/Methods/ 
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Test statistics © 


We have seen that we can fit statistical models to data that represent the hypotheses that we 
want to test. Also, we have discovered that we can use probability to see whether scores are 
likely to have happened by chance (section 1.7.4). If we combine these two ideas then we 
can test whether our statistical models (and therefore our hypotheses) are significant fits of 
the data we collected. To do this we need to return to the concepts of systematic and unsys¬ 
tematic variation that we encountered in section 1.6.2.2. Systematic variation is variation 
that can be explained by the model that we’ve fitted to the data (and, therefore, due to the 
hypothesis that we’re testing). Unsystematic variation is variation that cannot be explained 
by the model that we’ve fitted. In other words, it is error, or variation not attributable to 
the effect we’re investigating. The simplest way, therefore, to test whether the model fits the 
data, or whether our hypothesis is a good explanation of the data we have observed, is to 
compare the systematic variation against the unsystematic variation. In doing so we compare 
how good the model/hypothesis is at explaining the data against how bad it is (the error): 


test statistic = 


variance explained by the model 
variance not explained by the model 


effect 

error 


This ratio of systematic to unsystematic variance or effect to error is a test statistic, and 
you’ll discover later in the book there are lots of them: t, F and / 2 to name only three. The 
exact form of this equation changes depending on which test statistic you’re calculating, 
but the important thing to remember is that they all, crudely speaking, represent the same 
thing: the amount of variance explained by the model we’ve fitted to the data compared to 
the variance that can’t be explained by the model (see Chapters 7 and 9 in particular for a 
more detailed explanation). The reason why this ratio is so useful is intuitive really: if our 
model is good then we’d expect it to be able to explain more variance than it can’t explain. 
In this case, the test statistic will be greater than 1 (but not necessarily significant). 

A test statistic is a statistic that has known properties; specifically, we know how frequently 
different values of this statistic occur. By knowing this, we can calculate the probability of 
obtaining a particular value (just as we could estimate the probability of getting a score of a cer¬ 
tain size from a frequency distribution in section 1.7.4). This allows us to establish how likely it 
would be that we would get a test statistic of a certain size if there were no effect (i.e., the null 
hypothesis were true). Field and Hole (2003) use the analogy of the age at which people die. 
Past data have told us the distribution of the age of death. For example, we know that on aver¬ 
age men die at about 75 years old, and that this distribution is top heavy; that is, most people 
die above the age of about 50 and it’s fairly unusual to die in your twenties. So, the frequen¬ 
cies of the age of demise at older ages are very high but are lower at younger ages. From these 
data, it would be possible to calculate the probability of someone dying at a certain age. If we 
randomly picked someone and asked them their age, and it was 53, we could tell them how 
likely it is that they will die before their next birthday (at which point they’d probably punch 
us!). Also, if we met a man of 110, we could calculate how probable it was that he would have 
lived that long (it would be a very small probability because most people die before they reach 
that age). The way we use test statistics is rather similar: we know their distributions and this 
allows us, once we’ve calculated the test statistic, to discover the probability of having found a 
value as big as we have. So, if we calculated a test statistic and its value was 110 (rather like our 
old man) we can then calculate the probability of obtaining a value that large. The more varia¬ 
tion our model explains (compared to the variance it can’t explain), the bigger the test statistic 
will be, and the more unlikely it is to occur by chance (like our 110-year-old man). So, as test 
statistics get bigger, the probability of them occurring becomes smaller. When this probability 
falls below .05 (Fisher’s criterion), we accept this as giving us enough confidence to assume that 
the test statistic is as large as it is because our model explains a sufficient amount of variation to 
reflect what’s genuinely happening in the real world (the population). The test statistic is said 
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JANE SUPERBRAIN 2.6 

What we can and can’t conclude from a 
significant test statistic © 

The importance of an effect: We’ve seen already that 
the basic idea behind hypothesis testing involves us gen¬ 
erating an experimental hypothesis and a null hypoth¬ 
esis, fitting a statistical model to the data, and assessing 
that model with a test statistic. If the probability of obtain¬ 
ing the value of our test statistic by chance is less than 
.05 then we generally accept the experimental hypoth¬ 
esis as true: there is an effect in the population. Normally 
we say ‘there is a significant effect of... ’. However, don't 
be fooled by that word ‘significant’, because even if the 
probability of our effect being a chance result is small 
(less than .05) it doesn’t necessarily follow that the effect 
is important. Very small and unimportant effects can turn 
out to be statistically significant just because huge num¬ 
bers of people have been used in the experiment (see 
Field & Hole, 2003: 74). 

Non-significant results: Once you’ve calculated your 
test statistic, you calculate the probability of that test sta¬ 
tistic occurring by chance; if this probability is greater than 
.05 you reject your alternative hypothesis. However, this 
does not mean that the null hypothesis is true. Remember 
that the null hypothesis is that there is no effect in the 
population. All that a non-significant result tells us is that 
the effect is not big enough to be anything other than a 
chance finding - it doesn't tell us that the effect is zero. As 
Cohen (1990) points out, a non-significant result should 
never be interpreted as (despite the fact that it often is) ‘no 
difference between means’ or ‘no relationship between 
variables’. Cohen also points out that the null hypothesis 
is never true because we know from sampling distribu¬ 
tions (see section 2.5.1) that two random samples will 
have slightly different means, and even though these dif¬ 
ferences can be very small (e.g., one mean might be 10 
and another might be 10.00001) they are nevertheless 
different. In fact, even such a small difference would be 
deemed as statistically significant if a big enough sample 
were used. So, significance testing can never tell us that 
the null hypothesis is true, because it never is! 


Significant results: OK, we may not be able to 
accept the null hypothesis as being true, but we can at 
least conclude that it is false when our results are sig¬ 
nificant, right? Wrong! A significant test statistic is based 
on probabilistic reasoning, which severely limits what 
we can conclude. Again, Cohen (1994), who was an 
incredibly lucid writer on statistics, points out that formal 
reasoning relies on an initial statement of fact followed 
by a statement about the current state of affairs, and 
an inferred conclusion. This syllogism illustrates what I 
mean: 

• If a man has no arms then he can’t play guitar: 

o This man plays guitar, 
o Therefore, this man has arms. 

The syllogism starts with a statement of fact that allows 
the end conclusion to be reached because you can deny 
the man has no arms (the antecedent) by denying that he 
can’t play guitar (the consequent). 11 A comparable ver¬ 
sion of the null hypothesis is: 

• If the null hypothesis is correct, then this test statistic 
cannot occur: 

o This test statistic has occurred, 
o Therefore, the null hypothesis is false. 

This is all very nice except that the null hypothesis is not 
represented in this way because it is based on probabili¬ 
ties. Instead it should be stated as follows: 

• If the null hypothesis is correct, then this test statistic 
is highly unlikely: 

o This test statistic has occurred, 
o Therefore, the null hypothesis is highly unlikely. 

If we go back to the guitar example we could get a similar 
statement: 

• If a man plays guitar then he probably doesn’t play 
for Fugazi (this is true because there are thousands of 
people who play guitar but only two who play guitar in 
the band Fugazi!): 

o Guy Picciotto plays for Fugazi. 
o Therefore, Guy Picciotto probably doesn't play 
guitar. 

This should hopefully seem completely ridiculous - the 
conclusion is wrong because Guy Picciotto does play 
guitar This illustrates a common fallacy in hypothesis 
testing. In fact significance testing allows us to say very 
little about the null hypothesis. 


11 Thanks to Philipp Sury for unearthing footage that disproves my point (http://www.parcival.org/2007/05/22/ 
when-syllogisms-fail/). 
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to be significant (see Jane Superbrain Box 2.6 for a discussion of what statistically significant 
actually means). Given that the statistical model that we fit to the data reflects the hypothesis 
that we set out to test, then a significant test statistic tells us that the model would be unlikely 
to fit this well if the there was no effect in the population (i.e., the null hypothesis was true). 
Therefore, we can reject our null hypothesis and gain confidence that the alternative hypothesis 
is true (but, remember, we don’t accept it - see section 1.7.5). 


2 . 6 . 2 . 


One- and two-tailed tests © 


We saw in section 1.7.5 that hypotheses can be directional (e.g., ‘the more someone reads 
this book, the more they want to kill its author’) or non-directional (i.e., ‘reading more 
of this book could increase or decrease the reader’s desire to kill its author’). A statistical 
model that tests a directional hypothesis is called a one-tailed test, whereas one testing a 
non-directional hypothesis is known as a two-tailed test. 



—i---r-- n---r 

-4 -2 0 2 4 

Test Statistic 


FIGURE 2.10 

Diagram to show 
the difference 
between one- and 
two-tailed tests 


Imagine we wanted to discover whether reading this book increased or decreased the 
desire to kill me. We could do this either (experimentally) by taking two groups, one who 
had read this book and one who hadn’t, or (correlationally) by measuring the amount of 
this book that had been read and the corresponding desire to kill me. If we have no direc¬ 
tional hypothesis then there are three possibilities. (1) People who read this book want to 
kill me more than those who don’t so the difference (the mean for those reading the book 
minus the mean for non-readers) is positive. Correlationally, the more of the book you 
read, the more you want to kill me - a positive relationship. (2) People who read this book 
want to kill me less than those who don’t so the difference (the mean for those reading the 
book minus the mean for non-readers) is negative. Correlationally, the more of the book 
you read, the less you want to kill me - a negative relationship. (3) There is no difference 
between readers and non-readers in their desire to kill me - the mean for readers minus 
the mean for non-readers is exactly zero. Correlationally, there is no relationship between 
reading this book and wanting to kill me. This final option is the null hypothesis. The 
direction of the test statistic (i.e., whether it is positive or negative) depends on whether 
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the difference is positive or negative. Assuming there is a positive difference or 
relationship (reading this book makes you want to kill me), then to detect this 
difference we have to take account of the fact that the mean for readers is bigger 
than for non-readers (and so derive a positive test statistic). However, if we’ve 
predicted incorrectly and actually reading this book makes readers want to kill 
me less then the test statistic will actually be negative. 

What are the consequences of this? Well, if at the .05 level we needed to get a 
test statistic bigger than say 10 and the one we get is actually —12, then we would 
reject the hypothesis even though a difference does exist. To avoid this we can 
look at both ends (or tails) of the distribution of possible test statistics. This means 
we will catch both positive and negative test statistics. However, doing this has a 
price because to keep our criterion probability of .05 we have to split this prob¬ 
ability across the two tails: so we have .025 at the positive end of the distribution 
and .025 at the negative end. Figure 2.10 shows this situation - the tinted areas are the areas 
above the test statistic needed at a .025 level of significance. Combine the probabilities (i.e., 
add the two tinted areas together) at both ends and we get .05, our criterion value. Now if 
we have made a prediction, then we put all our eggs in one basket and look only at one end 
of the distribution (either the positive or the negative end, depending on the direction of the 
prediction we make). So, in Figure 2.10, rather than having two small tinted areas at either 
end of the distribution that show the significant values, we have a bigger area (the lined 
area) at only one end of the distribution that shows significant values. Consequently, we can 
just look for the value of the test statistic that would occur by chance with a probability of 
.05. In Figure 2.10, the lined area is the area above the positive test statistic needed at a .05 
level of significance. Note on the graph that the value that begins the area for the .05 level 
of significance (the lined area) is smaller than the value that begins the area for the .025 level 
of significance (the tinted area). This means that if we make a specific prediction then we 
need a smaller test statistic to find a significant result (because we are looking in only one 
tail of the distribution), but if our prediction happens to be in the wrong direction then we’ll 
miss out on detecting the effect that does exist. In this context it’s important to remember 
what I said in Jane Superbrain Box 2.4: you can’t place a bet or change your bet when the 
tournament is over. If you didn’t make a prediction of direction before you collected the 
data, you are too late to predict the direction and claim the advantages of a one-tailed test. 



2.6.3. 


Type I and Type II errors © 


We have seen that we use test statistics to tell us about the true state of the world (to a cer¬ 
tain degree of confidence). Specifically, we’re trying to see whether there is an effect in our 
population. There are two possibilities in the real world: there is, in reality, an effect in the 
population, or there is, in reality, no effect in the population. We have no way of knowing 
which of these possibilities is true; however, we can look at test statistics and their associated 
probability to tell us which of the two is more likely. Obviously, it is important that we’re as 
accurate as possible, which is why Fisher originally said that we should be very conservative 
and only believe that a result is genuine when we are 95% confident that it is - or when 
there is only a 5% chance that the results could occur if there was not an effect (the null 
hypothesis is true). However, even if we’re 95% confident there is still a small chance that 
we get it wrong. In fact there are two mistakes we can make: a Type I and a Type II error. A 
Type I error occurs when we believe that there is a genuine effect in our population, when in 
fact there isn’t. If we use Fisher’s criterion then the probability of this error is .05 (or 5%) 
when there is no effect in the population - this value is known as the a-Ievel. Assuming there 
is no effect in our population, if we replicated our data collection 100 times we could expect 
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that on five occasions we would obtain a test statistic large enough to make us think that 
there was a genuine effect in the population even though there isn’t. The opposite is a Type 
II error, which occurs when we believe that there is no effect in the population when, in real¬ 
ity, there is. This would occur when we obtain a small test statistic (perhaps because there is 
a lot of natural variation between our samples). In an ideal world, we want the probability 
of this error to be very small (if there is an effect in the population then it’s important that 
we can detect it). Cohen (1992) suggests that the maximum acceptable probability of a Type 
II error would be .2 (or 20%) - this is called the ,6-level. That would mean that if we took 
100 samples of data from a population in which an effect exists, we would fail to detect that 
effect in 20 of those samples (so we’d miss 1 in 5 genuine effects). 

There is obviously a trade-off between these two errors: if we lower the probability of 
accepting an effect as genuine (i.e., make a smaller) then we increase the probability that 
we’ll reject an effect that does genuinely exist (because we’ve been so strict about the level 
at which we’ll accept that an effect is genuine). The exact relationship between the Type I 
and Type II error is not straightforward because they are based on different assumptions: 
to make a Type I error there has to be no effect in the population, whereas to make a Type 
II error the opposite is true (there has to be an effect that we’ve missed). So, although we 
know that as the probability of making a Type I error decreases, the probability of mak¬ 
ing a Type II error increases, the exact nature of the relationship is usually left for the 
researcher to make an educated guess (Howell, 2006, gives a great explanation of the 
trade-off between errors). 


Effect sizes (D 


The framework for testing whether effects are genuine that I’ve just presented has a few 
problems, most of which have been briefly explained in Jane Superbrain Box 2.6. The 
first problem we encountered was knowing how important an effect is: just because a test 
statistic is significant doesn’t mean that the effect it measures is meaningful or important. 
The solution to this criticism is to measure the size of the effect that we’re testing in a stan¬ 
dardized way. When we measure the size of an effect (be that an experimental manipula¬ 
tion or the strength of a relationship between variables) it is known as an effect size. An 
effect size is simply an objective and (usually) standardized measure of the magnitude of 
observed effect. The fact that the measure is standardized just means that we can compare 
effect sizes across different studies that have measured different variables, or have used 
different scales of measurement (so an effect size based on speed in milliseconds could be 
compared to an effect size based on heart rates). Such is the utility of effect size estimates 
that the American Psychological Association is now recommending that all psy¬ 
chologists report these effect sizes in the results of any published work. So, it’s a 
habit well worth getting into. 

Many measures of effect size have been proposed, the most common of which 
are Cohen’s d, Pearson’s correlation coefficient r (Chapter 6) and the odds ratio 
(Chapter 18). Many of you will be familiar with the correlation coefficient as 
a measure of the strength of relationship between two variables (see Chapter 6 
if you’re not); however, it is also a very versatile measure of the strength of an 
experimental effect. It’s a bit difficult to reconcile how the humble correlation 
coefficient can also be used in this way; however, this is only because students are 
typically taught about it within the context of non-experimental research. I don’t 
want to get into it now, but as you read through Chapters 6, 9 and 10 it will (I 
hope!) become clear what I mean. Personally, I prefer Pearson’s correlation coef¬ 
ficient, r, as an effect size measure because it is constrained to lie between 0 (no 
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effect) and 1 (a perfect effect). 12 However, there are situations in which d may be favoured; 
for example, when group sizes are very discrepant r can be quite biased compared to d 
(McGrath & Meyer, 2006). 

Effect sizes are useful because they provide an objective measure of the importance of an 
effect. So, it doesn’t matter what effect you’re looking for, what variables have been measured, 
or how those variables have been measured - we know that a correlation coefficient of 0 means 
there is no effect, and a value of 1 means that there is a perfect effect. Cohen (1988, 1992) has 
also made some widely used suggestions about what constitutes a large or small effect: 

• r = .10 (small effect): In this case the effect explains 1% of the total variance. 

• r = .30 (medium effect): The effect accounts for 9% of the total variance. 

• r — .50 (large effect): The effect accounts for 25% of the variance. 

It’s worth bearing in mind that r is not measured on a linear scale, so an effect with r = .6 
isn’t twice as big as one with r — .3. Although these guidelines can be a useful rule of thumb 
to assess the importance of an effect (regardless of the significance of the test statistic), it is 
worth remembering that these ‘canned’ effect sizes are no substitute for evaluating an effect size 
within the context of the research domain where it is being used (Baguley, 2004; Lenth, 2001). 

A final thing to mention is that when we calculate effect sizes we calculate them for a 
given sample. When we looked at means in a sample we saw that we used them to draw 
inferences about the mean of the entire population (which is the value in which we’re actu¬ 
ally interested). The same is true of effect sizes: the size of the effect in the population is the 
value in which we’re interested, but because we don’t have access to this value, we use the 
effect size in the sample to estimate the likely size of the effect in the population. We can also 
combine effect sizes from different studies researching the same question to get better esti¬ 
mates of the population effect sizes. This is called meta-analysis - see Field (2001, 2005b). 


2.6.5. 


Statistical power (D 


Effect sizes are an invaluable way to express the importance of a research finding. The effect 
size in a population is intrinsically linked to three other statistical properties: (1) the sample 
size on which the sample effect size is based; (2) the probability level at which we will accept 
an effect as being statistically significant (the a-level); and (3) the ability of a test to detect an 
effect of that size (known as the statistical power, not to be confused with statistical powder, 
which is an illegal substance that makes you understand statistics better). As such, once we 
know three of these properties, then we can always calculate the remaining one. It will also 
depend on whether the test is a one- or two-tailed test (see section 2.6.2). Typically, in psychol¬ 
ogy we use an a-level of .05 (see earlier) so we know this value already. The power of a test is 
the probability that a given test will find an effect assuming that one exists in the population. 
If you think back you might recall that we’ve already come across the probability of failing to 
detect an effect when one genuinely exists (/3, the probability of a Type II error). It follows that 
the probability of detecting an effect if one exists must be the opposite of the probability of not 
detecting that effect (i.e., 1 — ft). I’ve also mentioned that Cohen (1988, 1992) suggests that 
we would hope to have a .2 probability of failing to detect a genuine effect, and so the cor¬ 
responding level of power that he recommended was 1 — .2, or .8. We should aim to achieve 
a power of .8, or an 80% chance of detecting an effect if one genuinely exists. The effect size 
in the population can be estimated from the effect size in the sample, and the sample size is 


12 The correlation coefficient can also be negative (but not below -1), which is useful when we’re measuring a rela¬ 
tionship between two variables because the sign of r tells us about the direction of the relationship, but in experi¬ 
mental research the sign of r merely reflects the way in which the experimenter coded their groups (see Chapter 6). 
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determined by the experimenter anyway so that value is easy to calculate. Now, there are two 
useful things we can do knowing that these four variables are related: 

1 Calculate the power of a test: Given that we’ve conducted our experiment, we will 
have already selected a value of a, we can estimate the effect size based on our 
sample, and we will know how many participants we used. Therefore, we can use 
these values to calculate /3, the power of our test. If this value turns out to be .8 or 
more we can be confident that we achieved sufficient power to detect any effects that 
might have existed, but if the resulting value is less, then we might want to replicate 
the experiment using more participants to increase the power. 

2 Calculate the sample size necessary to achieve a given level of power: Given that we 
know the value of a and f, we can use past research to estimate the size of effect that we 
would hope to detect in an experiment. Even if no one had previously done the exact 
experiment that we intend to do, we can still estimate the likely effect size based on simi¬ 
lar experiments. We can use this estimated effect size to calculate how many participants 
we would need to detect that effect (based on the values of a and /3 that we’ve chosen). 


The latter use is the more common: to determine how many participants should be used 
to achieve the desired level of power. The actual computations are very cumbersome, but 
fortunately there are now computer programs available that will do them for you (one 
example is G*Power, which is free and can be downloaded from a link on the companion 
website; another is nQuery Adviser, but this has to be bought!). Also, Cohen (1988) pro¬ 
vides extensive tables for calculating the number of participants for a given level of power 
(and vice versa). Based on Cohen (1992), we can use the following guidelines: if we take 
the standard a-level of .05 and require the recommended power of .8, then we need 783 
participants to detect a small effect size (r = .1), 85 participants to detect a medium effect 
size (r = .3) and 28 participants to detect a large effect size (r = .5). 




What have I discovered about statistics? © 


OK, that has been your crash course in statistical theory! Hopefully your brain is still 
relatively intact. The key point I want you to understand is that when you carry out 
research you’re trying to see whether some effect genuinely exists in your population 
(the effect you’re interested in will depend on your research interests and your specific 
predictions). You won’t be able to collect data from the entire population (unless you 
want to spend your entire life, and probably several after-lives, collecting data) so you 
use a sample instead. Using the data from this sample, you fit a statistical model to test 
your predictions, or, put another way, detect the effect you’re looking for. Statistics boil 
down to one simple idea: observed data can be predicted from some kind of model and 
an error associated with that model. You use that model (and usually the error associated 
with it) to calculate a test statistic. If that model can explain a lot of the variation in the 
data collected (the probability of obtaining that test statistic is less than .05) then you 
infer that the effect you’re looking for genuinely exists in the population. If the prob¬ 
ability of obtaining that test statistic is more than .05, then you conclude that the effect 
was too small to be detected. Rather than rely on significance, you can also quantify the 
effect in your sample in a standard way as an effect size and this can be helpful in gaug¬ 
ing the importance of that effect. We also discovered that I managed to get myself into 
trouble at nursery school. It was soon time to move on to primary school and to new 
and scary challenges. It was a bit like using R for the first time! 
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Key terms that I’ve discovered 


a-level 

Sample 

/3-level 

Sampling distribution 

Central limit theorem 

Sampling variation 

Confidence interval 

Standard deviation 

Degrees of freedom 

Standard error 

Deviance 

Standard error of the mean (SE) 

Effect size 

Sum of squared errors (SS) 

Fit 

Test statistic 

Linear model 

Two-tailed test 

Meta-analysis 

Type 1 error 

One-tailed test 

Type II error 

Population 

Variance 

Power 




Smart Alex’s tasks 


• Task 1: Why do we use samples? © 

• Task 2: What is the mean and how do we tell if it’s representative of our data? © 




Task 3: What’s the difference between the standard deviation and the standard error? © 


• Task 4: In Chapter 1 we used an example of the time taken for 21 heavy smokers to 
fall off a treadmill at the fastest setting (18,16,18, 24, 23, 22, 22, 23, 26, 29, 32, 34, 
34, 36, 36, 43, 42, 49, 46, 46, 57). Calculate the sums of squares, variance, standard 
deviation, standard error and 95% confidence interval of these data. © 

• Task 5: What do the sum of squares, variance and standard deviation represent? How 
do they differ? © 

• Task 6: What is a test statistic and what does it tell us? © 



• Task 7: What are Type I and Type II errors? © 

• Task 8: What is an effect size and how is it measured? © 

• Task 9: What is statistical power? © 

Answers can be found on the companion website. 


Further reading 


Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12), 1304-1312. 

Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49(12), 997-1003. (A couple 
of beautiful articles by the best modern writer of statistics that we’ve had.) 






CHAPTER 2 EVERYTHING YOU EVER WANTED TO KNOW ABOUT STATISTICS (WELL, SORT OF) 


Field, A. P, & Hole, G. J. (2003). How to design and report experiments. London: Sage. (I am rather 
biased, but I think this is a good overview of basic statistical theory.) 

Miles, J. N. V, & Banyard, P. (2007). Understanding and using statistics in psychology: a practical 
introduction. London: Sage. (A fantastic and amusing introduction to statistical theory.) 

Wright, D. B., & London, K. (2009). First steps in statistics (2nd ed.). London: Sage. (This book has 
very clear introductions to sampling, confidence intervals and other important statistical ideas.) 


Interesting real research 


Domjan, M., Blesbois, E., & Williams, J. (1998). The adaptive significance of sexual conditioning: 
Pavlovian control of sperm release. Psychological Science, 9(5), 411-415. 




The R environment 
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FIGURE 3.1 

All I want for 
Christmas is . 
some tasteful 
wallpaper 



3.1. What will this chapter tell me? © 


At about 5 years old I moved from nursery (note that I moved, I was not ‘kicked out’ for 
showing my ...) to primary school. Even though my older brother was already there, I 
remember being really scared about going. None of my nursery school friends were going 
to the same school and I was terrified about meeting lots of new children. I arrived in my 
classroom, and as I’d feared, it was full of scary children. In a fairly transparent ploy to 


62 











CHAPTER 3 THE R ENVIRONMENT 


63 


make me think that I’d be spending the next 6 years building sand castles, the teacher told 
me to play in the sand pit. While I was nervously trying to discover whether I could build a 
pile of sand high enough to bury my head in, a boy came and joined me. He was Jonathan 
Land, and he was really nice. Within an hour he was my new best friend (5-year-olds are 
fickle ...) and I loved school. Sometimes new environments seem scarier than they really 
are. This chapter introduces you to a scary new environment: R. The R environment is a 
generally more unpleasant environment in which to spend time than your normal environ¬ 
ment; nevertheless, we have to spend time there if we are to analyse our data. The purpose 
of this chapter is, therefore, to put you in a sand pit with a 5-year-old called Jonathan. I will 
orient you in your new home and reassure you that everything will be fine. We will explore 
how R works and the key windows in R (the console, editor and graphics!quartz windows). 
We will also look at how to create variables, data sets, and import and manipulate data. 


3.2. Before you start © 


R is a free software environment for statistical computing and graphics. It is what’s known 
as ‘open source’, which means that unlike commercial software companies that protec¬ 
tively hide away the code on which their software is based, the people who developed R 
allow everyone to access their code. This open source philosophy allows anyone, anywhere 
to contribute to the software. Consequently, the capabilities of R dynamically expand as 
people from all over the world add to it. R very much embodies all that is good about the 
World Wide Web. 


| The R-chitecture © 


In essence, R exists as a base package with a reasonable amount of functionality. Once you 
have downloaded R and installed it on your own computer, you can start doing some data 
analysis and graphs. However, the beauty of R is that it can be expanded by download¬ 
ing packages that add specific functionality to the program. Anyone with a big enough 
brain and a bit of time and dedication can write a package for other people to use. These 
packages, as well as the software itself, are stored in a central location known as the CRAN 
(Comprehensive R Archive Network). Once a package is stored in the CRAN, anyone with 
an Internet connection can download it from the CRAN and install it to use within their 
own copy of R. R is basically a big global family of fluffy altruistic people contributing to 
the goal of producing a versatile data analysis tool that is free for everyone to use. It’s a 
statistical embodiment of The Beatles’ utopian vision of peace, love and humanity: a sort 
of ‘give ps a chance’. 

The CRAN is central to using R: it is the place from where you download the software 
and any packages that you want to install. It would be a shame, therefore, if the CRAN 
were one day to explode or be eaten by cyber-lizards. The statistical world might col¬ 
lapse. Even assuming the cyber-lizards don’t rise up and overthrow the Internet, it is still 
a busy place. Therefore, rather than have a single CRAN location that everyone accesses, 
the CRAN is ‘mirrored’ at different places across the globe. ‘Mirrored’ simply means that 
there are identical versions of the CRAN scattered across the world. As a resident of the 
UK, I might access a CRAN location in the UK, whereas if you are in a different country 
you would likely access the copy of the CRAN in your own country (or one nearby). Bigger 
countries, such as the US, have multiple CRANs to serve them: the basic philosophy is to 
choose a CRAN that is geographically close to you. 
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FIGURE 3.2 

Users download 
R and install 
packages 
(uploaded by 
statisticians 
around the 
world) to their 
own computer 
via their nearest 
CRAN 



Mirrors 


Big Brains 


Your Computer 


CRAN 


Figure 3.2 shows schematically what we have just learnt. At the centre of the diagram is 
the CRAN: a repository of the base R software and hundreds of packages. People with big 
brains from all over the world write new packages and upload them into the CRAN for 
others to use. The CRAN itself is mirrored at different places across the globe (which just 
means there are multiple copies of it). As a user of R you download the software, and install 
any packages that you want to use via your nearest CRAN. 

The idea of needing to install ‘packages’ into a piece of software to get it to do something 
for you might seem odd. However, whether you realize it or not many programs work in 
this way (just less obviously so). For example, the statistical package SPSS has a base ver¬ 
sion, but also has many modules (for example, the bootstrapping module, advanced sta¬ 
tistics, exact tests and so on). If you have not paid for these modules then certain options 
will be unavailable to you. Many students do not realize that SPSS has this modular format 
because they use it at a university and the university has paid for all of the modules that 
they need. Similarly, in Microsoft Excel you need to load the data analysis add-in before 
you can use certain facilities. R is not unusual in having a modular system, and in being 
modular it has enormous flexibility: as new statistical techniques are developed, contribu¬ 
tors can react quickly to produce a package for R; a commercial organization would likely 
take much longer to include this new technique. 


3 . 2 . 2 . 


Pros and cons of R © 


The main advantages of using R are that it is free, and it is a versatile and dynamic envi¬ 
ronment. Its open source format and the ability of statisticians to contribute packages to 
the CRAN mean that there are many things that you can do that cannot be done in com¬ 
mercially available packages. In addition, it is a rapidly expanding tool and can respond 
quickly to new developments in data analysis. These advantages make R an extremely 
powerful tool. 

The downside to R is mainly ease of use. The ethos of R is to work with a command line 
rather than a graphical user interface (GUI). In layman’s terms this means typing instructions 
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rather than pointing, clicking, and dragging things with a mouse. This might seem weird at 
first and a rather ‘retro’ way of working but I believe that once you have mastered a few fairly 
simple things, R’s written commands are a much more efficient way to work. 


3 . 2 . 3 . 


Downloading and installing R © 


To install R onto your computer you need to visit the project website (http://www.R- 
project.org/). Figure 3.3 shows the process of obtaining the installation files. On the main 
project page, on the left-hand side, click on the link labelled ‘CRAN’. Remember from 
the previous section that there are various copies (mirrors) of the CRAN across the globe; 
therefore, the link to the CRAN will navigate you to a page of links to the various ‘mir¬ 
ror’ sites. Scroll down this list to find a mirror near to you (for example, in the diagram 



FIGURE 3.3 

Downloading R 
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I have highlighted the mirror closest to me, http://www.stats.bris.ac.Uk/R/) and click the 
link. Once you have been redirected to the CRAN mirror that you selected, you will see 
a web page that asks you which platform you use (Linux, MacOS or Windows). Click the 
link that applies to you. We’re assuming that most readers use either Windows or MacOS. 

If you click on the ‘Windows’ link, then you’ll be taken to another page with some more 
links; click on ‘base’, which will redirect you to the webpage with the link to the setup file, 
once there click on the link that says ‘Download R 2.12.2 for Windows’, 1 which will initi¬ 
ate the download of the R setup file. Once this file has been downloaded, double-click on 
it and you will enter a (hopefully) familiar install procedure. 

If you click on the ‘MacOS’ link you will be taken directly to a page from where 
you can download the install package by clicking on the link labelled ‘R-2.12.2.pkg’ 
(please read the footnote about version numbers). Clicking this link will download 
the install file; once downloaded, double-click on it and you will enter the normal 
MacOS install procedure. 


Versions of R © 



At the time of writing, the current version of R is 2.12.2; however, the software 
updates fairly regularly so we are confident that by the time anyone is actually read¬ 
ing this, there will be a newer release (possibly several). Notice that the 
format of the version number is major.minor.patch, which means that we 
are currently on major version 2, minor version 12 and patch 2. Changes 
in the patch number happen fairly frequently and usually reflect fixes 
of minor bugs (so, for example, version 2.12.3 will come along pretty 
quickly but won’t really be a substantial change to the software, just 
some housekeeping). Minor versions come less regularly (about every 6 
months) and still reflect a collection of bug fixes and minor housekeep¬ 
ing that keeps the software running optimally. Major releases are quite 
rare (the switch from version 1 to version 2 happened in 2006). As such, 
apart from minor fixes, don’t worry if you are using a more recent ver¬ 
sion of R than the one we’re using: it won’t make any difference, or 
shouldn’t do. The best advice is to update every so often but other than 
that don’t worry too much about which version you’re using; there are 
more important things in life to worry about. 


3.3. Getting started © 

Once you have installed R you can activate it in the usual way. In windows go to the 
start menu (the big windows icon in the bottom left of the screen) select ‘All Programs’, 
then scroll down to the folder labelled ‘R’, click on it, and then click on the R icon 
(Figure 3.4). In MacOS, go to your ‘Applications’ folder, scroll down to the R icon and 
click on it (Figure 3.4). 


1 At the time of writing the current version of R is 2.12.2, but by the time you read this book there will have been 
an update (or possibly several), so don’t be surprised if the ‘2.12.2’ in the link has changed to a different number. 
This difference is not cause for panic, the link will simply reflect the version number of R. 
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The main windows in R © 


There are three windows that you will use in R. The main window is called the console 
(Figure 3.4) and it is where you can both type commands and see the results of executing 
these commands (in other words, see the output of your analysis). Rather than writing 
commands directly into the console you can also write them in a separate window (known 
as the editor window). Working with this window has the advantage that you can save col¬ 
lections of commands as a file that you can reuse at another point in time (perhaps to rerun 
the analysis, or to run a similar analysis on a different set of data). I generally tend to work 
in this way rather than typing commands into the console because it makes sense to me 
to save my work in case I need to replicate it, and as you do more analyses you begin to 
have a repository of R commands that you can quickly adapt when running a new analysis. 
Ultimately you have to do what works for you. Finally, if you produce any graphics or 
graphs they will appear in the graphics window (this window is labelled quartz in MacOS). 



FIGURE 3.4 

Getting R started 



3 . 3 . 2 . 


Menus in R © 


Once R is up and running you’ll notice a menu bar similar to the ones you might have seen 
in other programs. Figure 3.4 shows the console window and the menu bar associated with 
this window. There are some subtle differences between Windows and MacOS versions of 
R and we will look at each version in the following two sections. At this stage, simply note 
that there are several menus at the top of the screen (e.g., I s |e b» m»»| ) that can be activated 
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by using the computer mouse to move the on-screen arrow onto the desired menu and 
then pressing the left mouse button once (I’ll call pressing this button clicking). When 
you have clicked on a menu, a menu box will appear that displays a list of options that 
can be activated by moving the on-screen arrow so that it is pointing at the desired 
option and then clicking with the mouse. Often, selecting an option from a menu 
makes a window appear; these windows are referred to as dialog boxes. When referring 
to selecting options in a menu I will use arrows to notate the menu paths; for example, 
if I were to say that you should select the Save As ... option in the File menu, you will 
see File=>Save As ... 

Before we look at Windows and MacOS versions of R, it’s worth saying that there are no 
substantive differences: all of the commands in the book work equally as well on Windows 
or MacOS. Other than pointing out a few differences in the next two sections, we won’t 
talk about Windows and MacOS again because it won’t make a difference to how you fol¬ 
low the book. If you happen to use Windows and see a screenshot from MacOS (or vice 
versa), this is not cause for a nervous breakdown - I promise. 


3.3.2.1. R in Windows © 


In R for Windows, the menus available depend upon which window is active; Table 3.1 
provides an overview of the main menus and their contents. The specific content of a 
particular menu also changes according to the window that’s active. For example, when 
you are in the graphics and editor windows the File menu pretty much only gives you the 
option to save, copy or print the graphic or text displayed in the window, but in the console 
window you have many more options. Most options in the menus can also be accessed with 
keyboard shortcuts (see R’s Souls’ Tip 3.1). 



Keyboard shortcuts © 


Within the menus of software packages on Windows some letters are underlined: these underlined letters rep¬ 
resent the keyboard shortcut for accessing that function. It is possible to select many functions without using 
the mouse, and the experienced keyboard user may find these shortcuts faster than manoeuvring the mouse 
arrow to the appropriate place on the screen. The letters underlined in the menus indicate that the option can be 
obtained by simultaneously pressing Alt on the keyboard and the underlined letter. So, to access the Save As... 
option, using only the keyboard, you should press Alt and F on the keyboard simultaneously (which activates the 
File menu), then, keeping your finger on the/4/f key, press A (which is the underlined letter). If these underlined 
letters are not visible, they can be displayed by pressing the Alt key. 


As well as the menus there is also a set of icons at the top of the data editor window (see 
Figure 3.4) that are shortcuts to specific facilities. All of these facilities can be accessed via 
the menu system but using the icons will save you time. Table 3.2 gives a brief overview of 
these icons and their functions. 
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Table 3.1 Overview of the menus in R for Windows 


Menu 

Console 

Editor 

Graphics 

File: This menu allows you to do general things such as 
saving the workspace (i.e., analysis output - see section 

3.4), scripts or graphs. Likewise, you can open previously 
saved files and print graphs, data or output. In essence, it 
contains all of the options that are customarily found in File 

menus. 


/ 

/ 

Edit: This menu contains edit functions such as cut and 
paste. From here you can also clear the console (i.e., 
remove all of the text from it), activate a rudimentary data 
editor, and change how the GUI looks (for example, by 
default the console shows black text on white background, 
you can change the colour of both the background and 
text). 

/ 

/ 


View: This menu lets you select whether or not to see the 
toolbar (the buttons at the top of the window) and whether 
to show a status bar at the bottom of the window (which 
isn't particularly interesting). 

/ 



Misc: This menu contains options to stop ongoing 
computations (although the ESC key does a quicker job), 
to list any objects in your working environment (these 
would be objects that you have created in the current 
session - see section 3.4), and also to select whether R 
autocompletes words and filenames for you (by default it 
does). 

/ 



Packages: This menu is very important because it is where 
you load, install and update packages. You can also set 
your default CRAN mirror so that you always head to that 
location. 

/ 

/ 


Window: If you have multiple windows, this menu allows 
you to change how the windows in R are arranged. 

/ 

/ 

/ 

Help: This is an invaluable menu because it offers you 
online help (links to frequently asked questions, the R 
webpage etc.), offline help (pdf manuals, and system help 
files). 


/ 


Resize: This menu is for resizing the image in the graphics 
window so that it is a fixed size, it is scaled to fit the window 
but retains its aspect ratio (fit to window), or it expands to fit 
the window but does not maintain its aspect ratio (R mode). 



/ 


3.3.2.2. RinMacOS© 


As with any software package for MacOS, the R menus appear at the top of the screen. 
Table 3.3 provides an overview of the main menus and their contents. We will refer back 
to these menus at various points so by all means feel free to explore them, but don’t worry 
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Table 3.2 Overview of the icons in R for Windows 



too much at this stage about what specific menu options do. As well as the menus there is 
a set of icons at the top of both the editor and console windows, which provide shortcuts 
to specific facilities. All of these facilities can be accessed via the menu system or by typing 
commands, but using the icons can save you time. Table 3.4 overviews of these icons and 
their functions. 
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Table 3.3 Overview of the menus in R for MacOS 


Menu 


File: This menu allows you to do general things such as saving scripts or graphs. Likewise, you 
can open previously saved files and print graphs, data or output. In essence, it contains all of the 
options that are customarily found in File menus. 

Edit: This menu contains edit functions such as cut and paste. From here you can also clear the 
console (i.e., remove all of the text from it), execute commands, find a particular bit of text and so 
on. 

Format: This menu lets you change the text styles used (colour, font, etc.). 

Workspace: This menu enables you to save the workspace (i.e., analysis output - see section 
3.4), load an old workspace or browse your recent workspace files. 

Packages & Data: This menu is very important because it is where you load, install and update 
packages. 

Misc: This menu enables you to set or change the working directory. The working directory is the 
default location where R will search for and save files (see section 3.4.4). 

Window: If you have multiple windows, this menu allows you to change how the windows in R 
are arranged. 

Help: This is an invaluable menu because it offers you a searchable repository of help and 
frequently asked questions. 


3.4. Using R© 

Commands, objects and functions © 


I have already said that R uses ‘commands’ that are typed into the console window. As 
such, unlike other data analysis packages with which you might be familiar (e.g., SPSS, 
SAS), there are no friendly dialog boxes that you can activate to run analyses. Instead, 
everything you want to do has to be typed into the console (or executed from a script file). 
This might sound like about as much fun as having one of the living dead slowly chewing 
on your brain, but there are advantages to working in this way: although there is a steep 
initial learning curve, after time it becomes very quick to run analyses. 

Commands in R are generally made up of two parts: objects and functions. These are 
separated by which you can think of as meaning ‘is created from’. As such, the general 
form of a command is: 

Objectc-function 

Which means ‘object is created from function’. An object is anything created in R. It could 
be a variable, a collection of variables, a statistical model, etc. Objects can be single values 
(such as the mean of a set of scores) or collections of information; for example, when you 
run an analysis, you create an object that contains the output of that analysis, which means 
that this object contains many different values and variables. Functions are the things that 
you do in R to create your objects. In the console, to execute a command you simply type 
it into the console and press the return key. (You can put more than one command on a 
single line if you prefer - see R’s Souls’ Tip 3.2) 
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Table 3.4 Overview of the icons in R for MacOS 


icon 

Description 

Console 

Editor 

© 

Clicking this button stops the R processor from whatever it is 
doing. 

/ 



Clicking this button opens a dialog box that enables you to 
select a previously saved script or data file. 

/ 


illii. 

Clicking this button opens a new graphics (quartz) window. 

/ 


[x 

Clicking this button opens the XI f window; XII is a device 
that some R packages use. 

/ 


LJ 

Clicking this button opens a dialog box into which you can 
enter your system password. This will enable R to run system 
commands. Frankly, 1 have never touched this button and 1 
suspect it is to be used only by people who actually know 
what they’re doing. 

/ 



M 

Clicking this button activates a sidebar on the console 
window that lists all of your recently executed commands. 

/ 



Clicking this button opens the Preferences dialog box, from 



I which you can change the console colours (amongst other 

things). 

/ 


oh 

Clicking this button opens a dialog box from which you can 
select and open a previously saved script file. This file will 

/ 


- * 

open in the editor window. 



V 

Clicking this button opens a new editor window in which you 

/ 



can create a new script file. 



This icon activates a dialog box for printing whatever you 




are currently working on (what is printed depends on which 
window is active). 

/ 


H 

Clicking this button saves the script file that you’re working 
on. If you have not already saved the file, clicking this button 
activates a Save As ... dialog box. 



a 

Clicking this button quits R. 

/ 



Figure 3.5 shows a very simple example in which we have created an object called ‘metal¬ 
lica’, which is made up of the four band members’ (pre 2001) names. The function used 
is the concatenate function or c(), which groups things together. As such, we have written 
each band member’s name (in speech marks and separated by commas), and by enclosing 
them in c{) we bind them into a single entity or object, which we have called ‘metallica’. If 
we type this command into the console then when we hit the return key on the keyboard 
the object that we have called ‘metallica’ is created. This object is stored in memory so 
we can refer back to it in future commands. Throughout this book, we denote commands 
entered into the command line in this way: 
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Running multiple commands at once® 


The command line format of R tends to make you think that you have to run commands one at a time. Even if you 
use the R editor it is tempting to put different commands on a new line. There's nothing wrong with doing this, 
and it can make it easier to decipher your commands if you go back to a long script months after you wrote it. 
However, it can be useful to run several commands in a single line. Separating them with a semicolon does this. 
For example, the two commands: 


metallica<-metallica[metallica != "Jason"] 


metallica<-cCmetallica, "Rob") 

can be run in a single line by using a semicolon to separate them: 

metallica<-metallica[metallica != "Jason"]; metallica<-c(metallica, "Rob") 



metallica <-c("Lars", "James", "Jason", "Kirk") 

^ - _-. 

T V 


Object 


Function 


FIGURE 3.5 

Using the 
command line 
in R 
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metallica<-c("Lars","James","Jason","Kirk") 

Now we have created an object called ‘metallica’ we can do things with it. First, we can 
have a look at its contents by typing ‘metallica’ (or ‘print(metallica)’ works too) into the 
command line and hitting the return key: 

metallica 

The contents of the object ‘metallica’ will be displayed in the console window. Throughout 
the book we display output as follows: 

[1] "Lars" "James" "Jason" "Kirk" 

Note that R has printed into the console the contents of the object ‘metallica’, and the 
contents are simply the four band members’ names. You need to be very careful when 
you type commands and create objects in R, because it is case sensitive (see R’s Souls’ 
Tip 3.3). 



R is case sensitive 0 


R is case sensitive, which means that if the same things are written in upper or lower case, R thinks that they 
are completely different things. For example, we created a variable called metallica; if we asked to see the con¬ 
tents of Metallica (note the capital M), R would tell us that this object didn’t exist. If we wanted to completely 
confuse ourselves we could actually create a variable called Metallica (with a capital M) and put different data 
into it than in the variable metallica (with a small m), and R would have no problem with us doing so. As far 
as R is concerned, metallica and Metallica are as different to each other as variables called earwax and 
doseOfBellendium 

This case sensitivity can create problems if you don’t pay attention. Functions are generally lower case so you 
just need to avoid accidentally using capitals, but every so often you find a function that has a capital letter (such 
as as.Datef) used in this chapter) and you need to make sure you have typed it correctly. For example, if you want 
to use the function data.framef) but type data.Frame() or Data.Frame 0 you will get an error. If you get an error, 
check that you have typed any functions or variable names exactly as they should be. 


We can do other things with our newly created object too. The Metallica fans amongst 
you will probably be furious at me for listing the pre 2001 line up of the band. In 2001 
bassist Jason Newstead left the band and was replaced by Rob Trujillo. Even as I type, there 
are hoards of Metallica fans with precognition about the contents of this book queuing 
outside my house and they have dubbed me unforgiven. Personally I’m a big fan of Rob 
Trujillo, he’s given the band a solid kick up the backside, and so let’s put him in his rightful 
place in the band. We currently have a ‘metallica’ object that contains Jason. First we can 
change our object to eject Jason (harsh, I know). To get rid of Jason in R we can use this 
command: 

metallica<-metallica[metallica != "Jason"] 

This just means that we’re re-creating the object ‘metallica’, the ‘<-’ means that ‘we’re 
creating it from’ and our function is metallica[metallica != “Jason”] which means ‘use the 
object metallica, but get rid of (!=) Jason’. A simple line of text and Jason is gone, which 
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was probably a lot less hassle than his actual ousting from the band. If only Lars and James 
had come to me for advice. If we have a look at our ‘metallica’ object now we’ll see that 
it contains only three names. We can do this by simply typing ‘metallica’ and hitting the 
return key. Below shows the command and the output: 

metallica 

[1] "Lars" "James" "Kirk" 

Now let’s add Rob Trujillo to the band. To do this we can again create an object called 
‘metallica’ (which will overwrite our previous object), and we can use the concatenate com¬ 
mand to take the old ‘metallica’ object and add “Rob” to it. The command looks like this: 

metallica<-cCmetallica, "Rob") 

If we execute this command (by pressing return) and again look at the contents of ‘metal- 
lica’ we will see that Rob has been added to the band: 

metallica 

[1] "Lars" "James" "Kirk" "Rob" 



SELF-TEST 

s Create an object that represents your favourite band 
(unless it’s Metallica, in which case use your second 
favourite band) and that contains the names of each 
band member. If you don’t have a favourite band, 
then create an object called friends that contains the 
names of your five best friends. 


Using scripts © 


Although you can execute commands from the console, I think it is better to write com¬ 
mands in the R editor and execute them from there. A document of commands written in 
the R editor is known as a script. There are several advantages to this way of working. First, 
at the end of your session you can save the script file, which can be reloaded in the future 
if you need to re-create your analysis. Rerunning analyses, therefore, becomes a matter of 
loading a file and hitting a few buttons - it will take less than 10 seconds. Often in life you 
need to run analyses that are quite similar to ones that you have run before; if you have a 
repository of scripts then it becomes fairly quick to create new ones by editing an existing 
one or cutting and pasting commands from existing scripts and then editing the variable 
names. Personally I find that using old scripts to create new ones speeds things up a lot, but 
this could be because I’m pretty hopeless at remembering how to do things in R. Finally, 
I often mess things up and run commands that throw error messages back in my face; if 
these commands are written directly into the console then you have to rewrite the whole 
command (or cut and paste the wrong command and edit it), whereas if you ran the com¬ 
mand from the editor window then you can edit the command directly without having to 
cut and paste it (or rewrite it), and execute it. Again, it’s a small saving in time, but these 
savings add up until eventually the savings outweigh the actual time you’re spending doing 
the task and then time starts to run backwards. I was 56 when I started writing this book, 
but thanks to using the editor window in R I am now 37. 
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FIGURE 3.6 
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the R editor 
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Figure 3.6 shows how to execute commands from the editor window. Assuming you 
have written some commands, all you need to do is to place the cursor in the line contain¬ 
ing the command that you want to execute, or if you want to execute many commands in 
one go then highlight a block of commands by dragging over them while holding down the 
left mouse button. Once your commands are highlighted, you can execute them in one of 
several ways. 

In Windows, you have a plethora of choices: you can (1) click on H; (2) click the right 
mouse button while in the editor window to activate a menu, then click with the left mouse 
button on the top option which is to run the command (see Figure 3.6); (3) go through 
the main menus by selecting Edit=>Run line or selection; or (4) press and hold down the 
Ctrl key, and while holding it down press and release the letter R on the keyboard (this 
is by far the quickest option). In the book we notate pressing a key while another is held 
down as ‘hold + press’, for example Ctrl + R means press the R key while holding down 
the Ctrl key. 

In MacOS you can run the highlighted commands, or the current line, through the 
menus by selecting Edit=>Execute, but as with Windows the keyboard shortcut is much 
quicker: press and hold down the and key (§€), and while holding it down press and release 
the return key (J). In case you skipped the previous paragraph, we will notate pressing a 
key while another is held down as ‘hold + press’, for example 9€ + J means press the J 
key while holding down the §€. 

You’ll notice that the commands appear in the console window as they are executed, 
along with any consequences of those commands (for example, if one of your commands 
asks to view an object the contents will be printed in the console just the same as if you had 
typed the command directly into the console). 


The R workspace © 


As you work on a given data set or analysis, you will create many objects, all of which are 
stored in memory. The collection of objects and things you have created in a session is 
known as your workspace. When you quit R it will ask you if you want to save your current 
workspace. If you choose to save the workspace then you are saving the contents of the 
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console window and any objects that have been created. The file is known as an R image 
and is saved in a file with .RData at the end. You can save the workspace at any time using 
the File=>Save Workspace ... menu in Windows or in MacOS make sure you are in the 
console window and select File=>Save As .... 


Setting a working directory © 


By default, when you try to do anything (e.g., open a file) from R it will go to the directory 
in which the program is stored on your computer. This is fine if you happen to store all of 
your data and output in that folder, but it is highly unlikely that you do. If you don’t then 
every time you want to load or save a file you will find yourself wasting time using the 
menus to navigate around your computer to try to find files, and you will probably lose 
track of things you save because they have been dumped in R’s home folder. You will also 
end up having to specify the exact file path for every file you save/access. For example, 
assuming that you’re using Windows, your user name is ‘Andy F’ (because you’ve stolen my 
identity), you have a folder in your main documents folder called ‘Data’ and within that 
you have another folder called ‘R Book Examples’, then if you want to access this folder 
(to save or load a file) you’d have to use this file path: 

C:/Users/Andy F/Documents/Data/R Book Examples 

So, to load a file called data.dat from this location you would need to execute the follow¬ 
ing command: 

myData = read.delim("C:/Users/Andy F/Documents/Data/R Book Examples/data, 
dat") 

Don’t worry about what this command means (we’ll get to that in due course), I just 
want you to notice that it is going to get pretty tedious to keep typing ‘C:/Users/Andy F/ 
Documents/Data/R Book Examples’ every time you want to load or save something. 

If you use R as much as I do then all this time typing locations has two consequences: (1) 
all those seconds have added up and I have probably spent weeks typing file paths when I 
could have been doing something useful like playing my drum kit; (2) I have increased my 
chances of getting RSI in my wrists, and if I’m going to get RSI in my wrists I can think 
of more enjoyable ways to achieve it than typing file paths (drumming again, obviously). 

The best piece of advice I can give you is to establish a working directory at the beginning 
of your R session. This is a directory in which you want to store your data files, any scripts 
associated with the analysis or your workspace. Basically, anything to do with a session. 
To begin with, create this folder (in the usual way in Windows or MacOS) and place the 
data files you’ll be using in that folder. Then, when you start your session in R change the 
working directory to be the folder that you have just created. Let’s assume again that you’re 
me (Andy F), that you have a folder in ‘My Documents’ called ‘Data’ and within that you 
have created a folder called ‘R Book Examples’ in which you have placed some data files 
that you want to analyse. To set the working directory to be this folder, we use the setwd() 
command to specify this newly created folder as the working directory: 

setwdC'C:/Users/Andy F/Documents/Data/R Book Examples") 

By executing this command, we can now access files in that folder directly without having 
to reference the full file path. For example, if we wanted to load our data.dat file again, we 
can now execute this command: 

myData = read.delim("data.dat") 
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Compare this command with the one we wrote earlier; it is much shorter because we can 
now specify only the file name safe in the knowledge that R will automatically try to find 
the file in ‘C:/Users/Andy F/Documents/Data/R Book Examples’. If you want to check what 
the working directory is then execute this command: 

getwd() 

Executing this command will display the current working directory in the console 
window. 2 

In MacOS you can do much the same thing except that you won’t have a C drive. 
Assuming you are likely to work in your main user directory, the easiest thing to do is to 
use the ‘ ~ ’ symbol, which is a shorthand for your user directory. So, if we use the same 
file path as we did for Windows, we can specify this as: 

setwd("~/Documents/Data/R Book Examples") 

The ~ specifies the MacOS equivalent of ‘C:/Users/Andy F’. Alternatively, you can navigate 
to the directory that you want to use using the Misc=>Change Working Directory menu 
path (or $€ + D). 

Throughout the book I am going to assume that for each chapter you have stored the 
data files somewhere that makes sense to you and that you have set this folder to be your 
working directory. If you do not do this then you’ll find that commands that load and save 
files will not work. 


Installing packages © 


Earlier on I mentioned that R comes with some base functions ready for you to use. 
Elowever, to get the most out of it we need to install packages that enable us to do particu¬ 
lar things. For example, in the next chapter we look at graphs, and to create the graphs 
in that chapter we use a package called ggplotl. This package does not come pre-installed 
in R so to work through the next chapter we would have to install ggplot2 so that R can 
access its functions. 

You can install packages in two ways: through the menus or using a command. If you 
know the package that you want to install then the simplest way is to execute this command: 

install.packages("package.name") 

in which ‘package.name’ is replaced by the name of the package that you’d like installed. 
For example, we have (hopefully) written a package containing some functions that are 
used in the book. This package is called DSUR, therefore, to install it we would execute: 

install,packages("DSUR") 

Note that the name of the package must be enclosed in speech marks. 

Once a package is installed you need to reference it for R to know that you’re using it. 
You need to install the package only once 3 hut you need to reference it each time you start a 
new session ofR. To reference a package, we simply execute this general command: 

libraryCpackage.name) 


2 In Windows, the filepaths can also be specified using £ \V to indicate directories, so that “C:/Users/Andy F/Docu- 
ments/Data/R Book Examples” is exactly the same as “C: \\Users\\Andy F\\Documents\\Data\\R Book Examples”. 
R tends to return filepaths in the ‘W form, but will accept it if you specify them using 7’. Try not to be confused 
by these two different formats. MacOS users don’t have these tribulations. 

3 This isn’t strictly true: if you upgrade to a new version of R you will have to reinstall all of your packages again. 



CHAPTER 3 THE R ENVIRONMENT 


79 


in which ‘package.name’ is replaced by the name of the package that you’d like to use. 
Again, if we want to use the DSUR package we would execute: 

library(DSUR) 

Note that in this command the name of the package is not enclosed in speech marks. 

Alternatively you can manage packages through the menu system. Figure 3.7 overviews 
the menus for managing packages. In Windows if you select Packages=>Install package(s)... 
a window will open that first asks you to select a CRAN. Having selected the CRAN near¬ 
est to you from the list and clicked on | ok | , a new dialog box will open that lists all 
of the available packages. Click on the one or ones that you want (you can select several 
by holding down the Ctrl key as you click) and then click on [ ok | . This will have the 
same effect as using the install.packages() command. You can load packages by selecting 
Packages=>Load package..., which opens a dialog box with all of the available packages 
that you could load. Select the one(s) you want to load and then click on | ok | . This has 
the same effect as the library() command. 

In MacOS if you select Packages & Data=>Package Installer a window will open. Click 

on . _Cetjjst _and a list of all the available packages appears. Click on the one or ones that 

you want (you can select several by holding down the $€ key as you click) and then click on 
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; install selected This will have the same effect as using the install.packagesQ command. You 
can load packages by selecting Packages & Data=>Package Manager, which opens a dialog box 
with all of the available packages that you could load. Click on the tick boxes next to the one(s) 
you want to load. This has the same effect as the libraryQ command. 

One entertaining (by which I mean annoying) consequence of any Tom, Dick or Harriet 
being able to contribute packages to R is that you sometimes encounter useful functions 
that have the same name as different functions in different packages. For example, there is 
a recode() function that exists in both the Hmisc and car packages. Therefore, if you have 
both of these packages loaded you will need to tell R which particular recode function you 
want to use (see R’s Souls’ Tip 3.4). 



Disambiguating functions 0 


Occasionally you might stumble across two functions in two different packages that have the same name. For 
example, there is a recode() function in both the Hmisc and car packages. If you have both packages loaded and 
you try to use recodef), R won’t know which one to use or will have a guess (perhaps incorrectly). This situation 
is easy to rectify: you can specify the package when you use the function as follows: 


package::function() 


For example, if we want to use the recode() function in the car package we would write: 
car:: recodeQ 


but to use the one in Hmisc we would write: 


Hmisc:: recodeQ 


Here is an example where we recode a variable using recodef) from the car package: 
variableName <-car::recode(variableName, "2=0;0=2") 


Getting help © 


There is an enormous amount of information on the Internet about using R, and I gener¬ 
ally find that if I get stuck I can find help with a quick Google (or whatever search engine 
you use) search. However, there is help built into R as well. If you are using a particular 
function and you want to know more about it then you can get help by executing the help() 
command: 

help(function) 

or by executing: 

?function 

In both cases function is the name of the function about which you need help. For example, 
we used the concatenate function earlier on, c(), if we wanted help with this function we 
could execute either: 

help(c) 


or 
? c 
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These commands open a new window with the help documentation for that function. Be 
aware that the help files are active only if you have loaded the package to which the func¬ 
tion belongs. Therefore, if you try to use help but the help files are not found, check that 
you have loaded the relevant package with the libraryQ command. 


3.5. Getting data into R © 

Creating variables © 


You can enter data directly into R. As we saw earlier on, you can use the c() function to cre¬ 
ate objects that contain data. The example we used was a collection of names, but you can 
do much the same with numbers. Earlier we created an object containing the names of the 
four members of metallica. Let’s do this again, but this time call the object metallicaNames. 
We can create this object by executing the following command: 

metallicaNames<-c("Lars","James","Kirk","Rob") 

We now have an object called metallicaNames containing the band members’ names. When 
we create objects it is important to name them in a meaningful way and you should put 
some thought into the names that you choose (see R’s Souls’ Tip 3.7). 

Let’s say we wanted another object containing the ages of each band member. At the time 
of writing, their ages are 47, 47, 48 and 46, respectively. We can create a new object called 
metallicaAges in the same way as before, by executing: 

metallicaAges<-c(47, 47, 48, 46) 

Notice that when we specified names we placed the names in quotes, but when we 
entered their ages we did not. The quotes tell R that the data are not numeric. Variables 
that consist of data that are text are known as string variables. Variables that contain 
data that are numbers are known as numeric variables. R and its associated packages 
tend to be able to treat data fairly intelligently. In other words, we don’t need to tell 
R that a variable is numeric or not, it sort of works it out for itself - most of the time 
at least. However, string values should always be placed in quotes, and numeric val¬ 
ues are never placed in quotes (unless you want them to be treated as text rather than 
numbers). 


3 . 5 . 2 . 


Creating dataframes © 


We currently have two separate objects: metallicaNames and metallicaAges. Wouldn’t it be 
nice to combine them into a single object? We can do this by creating a dataframe. You can 
think of a dataframe as a spreadsheet (so, like the contents of the data editor in SPSS, or a 
worksheet in Excel). It is an object containing variables. There are other ways to combine 
variables in R but dataframes are the way we will most commonly use because of their ver¬ 
satility (R’s Souls’ Tip 3.5). If we want to combine metallicaNames and metallicaAges into 
a dataframe we can use the data.frame() function: 

metallicac-data.frameCName = metallicaNames, Age = metallicaAges) 

In this command we create a new object (called metallica ) and we create it from the func¬ 
tion data.frame(). The text within the data.frame() command tells R how to build the 
dataframe. Lirst it tells R to create an object called ‘Name’, which is equal to the existing 
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object metallicaNames. Then it tells R to create an object called ‘Age’, which is equal to the 
existing object metallicaAges. We can look at the contents of the dataframe by executing: 

metallica 

You will see the following displayed in the console: 

Name Age 

1 Lars 47 

2 James 47 

3 Kirk 48 

4 Rob 4 6 

As such, our dataframe consists of two variables (Name and Age), the first is the band 
member’s name, and the second is their age. Now that the dataframe has been created we 
can refer to these variables at any point using the general form: 

dataframe$variableName 

For example, if we wanted to use the ages of metallica, we could refer to this variable as: 
metallica$Age 

similarly, if we want the Name variable we could use: 
metallica$Name 

Let’s add a new variable that contains the age of each member’s eldest child; we will call 
this variable childAge. According to an Internet search, James’s (Cali) and Lars’s (Myles) 
eldest children were both born in 1998, Kirk’s (Angel) was born in 2006 and Rob’s (Tye- 
Orion) in 2004. At the time of writing, this makes them 12, 12, 4 and 6, respectively. We 
can add this variable using the c() function as follows: 

metallica$childAge<-c(12, 12, 4, 6) 

This command is fairly straightforward: metallica$childAge simply creates the variable 
childAge in the pre-existing dataframe metallica. As always the '<-’ means ‘create from’, 
then the c() function allows us to collect together the numbers representing each member’s 
eldest child’s age (in the appropriate order). 

We can look at the contents of the dataframe by executing: 

metallica 

You will see the following displayed in the console: 



Name 

Age 

childAge 

1 

Lars 

47 

12 

2 

James 

47 

12 

3 

Kirk 

48 

4 

4 

Rob 

46 

6 


Notice that the new variable has been added. 

Sometimes, especially with large dataframes, it can be useful to list the variables in the 
dataframe. This can be done using the names() function. You simply specify the name 
of the dataframe within the brackets; so, if we want to list the variables in the metallica 
dataframe, we would execute: 

names(metallica) 

The output will be a list of the variable names in the dataframe: 

[1] "Name" "Age" "childAge" 

In this case, R lists the names of the three variables in the dataframe. 
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The Iist() and cbind() functions 0 


Dataframes are not the only way to combine variables in R: throughout the book you will find us using the list() 
and cbind() functions to combine variables. The list() function creates a list of separate objects; you can imagine 
it as though it is your handbag (or manbag) but nicely organized. Your handbag contains lots of different objects: 
lipstick, phone, iPod, pen, etc. Those objects can be different, but that doesn’t stop them being collected into the 
same bag. The list() function creates a sort of bag into which you can place objects that you have created in R. 
However, it’s a well-organized bag and so objects that you place in it are given a number to indicate whether they 
are the first, second etc. object in the bag. For example, if we executed these commands: 

metallica<-li st(metallicaNames, metallicaAges) 

instead of the data.frame() function from the chapter, we would create a R-like handbag called metallica that looks 
like this: 

[ [l] ] 

[1] "Lars" "James" "Kirk" "Rob" 

[ [2] ] 

[1] 47 47 48 46 

Object [1] in the bag is the list of names, and object [2] in the bag is the list of ages. 

The function cbind() is used simply for pasting columns of data together (you can also use rbindf) to combine 
rows of data together). For example, if we execute: 

metallica<-cbind(metallicaNames, metallicaAges) 

instead of the data.frame() function from the chapter, we would create a matrix called metallica that looks like this: 

metallicaNames metallicaAges 

[1,] "Lars" " 47 " 

[2,] "James" " 47 " 

[3,] "Kirk" "48" 

[4,] "Rob" "46" 

Notice that the end result is that the two variables have been pasted together as different columns in the same 

object. However, notice that the numbers are in quotes; this is because the variable containing names is text, so it 
causes the ages to be text as well. For this reason, cbind() is most useful for combining variables of the same type. 

In general, dataframes are a versatile way to store variables: unlike cbind(), data.frame() stores variables of 
different types together (trivia note: cbind() works by using the data.frame() function so they’re basically the same). 
Therefore, we tend to work with dataframes; however, we will use listQ sometimes because some functions like to 
work with lists of variables, and we will sometimes use cbindQ as a quick method for combining numeric variables. 



Calculating new variables from exisiting ones © 


Although we’re not going to get into it too much here (but see Chapter 5), we can also 
use arithmetic and logical operators to create new variables from existing ones. Table 3.5 
overviews some of the basic operators that are used in R. As you can see, there are many 
operations with which you will be familiar (but see R’s Souls’ Tip 3.6) that you can use on 
variables: you can add them (using +), subtract them (using -), divide them (using /), and 
multiply them (using *). We will encounter these and the others in the table as we progress 
through the book. For now, though, we will look at a simple example to give you a sense 
that dataframes are versatile frameworks for storing and manipulating data. 
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Table 3.5 Some of main operators that can be used in R 


Operator 

What it does 

+ 

Adds things together 

- 

Subtracts things 

★ 

Multiplies things 

/ 

Divides things 

Qr ** 

Exponentiation (i.e., to the power of, so, x^2 or x**2 is x 2 , x^3 isx 3 and so on) 

< 

Less than 

< = 

Less than or equal to 

> 

Greater than 

> = 

Greater than or equal to 

= = 

Exactly equals to (this might confuse you because you’ll be used to using '=’ as 
the symbol for ‘equals’, but in R you usually use ' = = ’) 

! = 

Not equal to 

!x 

Notx 

x I y 

x ORy (e.g., name == “Lars" dames" means ‘the variable name is equal to either 
Lars or James’) 

x & y 

x ANDy (e.g., age == 47 & name == dames” means ‘the variable age is equal to 
47 and the variable name is equal to James’) 

isTRUE(x) 

Test if x is TRUE 



Equals signs CD 


A common cause of errors in R is that you will have spent your whole life using the symbol ' = ’ when you want 
to say ‘equals’. For example, you’ll all be familiar with the idea that age = 37 is interpreted as ‘age equals 37’. 
However, in a transparent attempt to wilfully confuse us, R uses the symbol '= = ’ instead. At first, you might 
find that if you get error messages it is because you have used ' = ' when you should have used ’ = = ’. It’s worth 
checking your command to see whether you have inadvertently let everything you have ever learnt about equals 
signs get the better of you. 


If we wanted to find out how old (roughly) each band member was when he had their 
first child, then we can subtract his eldest child’s age from his current age. We can store 
this information in a new variable (fatherhoodAge). We would create this new variable as 
follows: 

metallica$fatherhoodAge<- metallica$Age - metallica$childAge 

This command is again straightforward: metallica$fatherhoodAge simply creates the vari¬ 
able called fatherhoodAge in the existing dataframe ( metallica ). The means ‘create 
from’, then follows the instructions about how to create it; we ask that the new variable 
is the child’s age (which is the variable childAge in the metallica data set, referred to as 
metallica$childAge ) subtracted from ( —) the member’s age ( metallica$Age ). Again, if we 
look at the dataframe by executing 
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metallica 

we see that a new variable has been created containing the age of each band member when 
they had their first child. We can see from this that James and Lars were both 35 years old, 
Kirk was 44 and Rob was 40. 



Name 

Age 

childAge 

fatherhoodAge 

1 

Lars 

47 

12 

35 

2 

James 

47 

12 

35 

3 

Kirk 

48 

4 

44 

4 

Rob 

46 

6 

40 



Naming variables 0 


There are conventions about naming variables and objects in R. Unfortunately these conventions sometimes 
contradict each other. For example, the Google style guide for R recommends that ‘Variable names should have 
all lower case letters and words separated with dots So, for example, if you had a variable representing chil¬ 
dren’s anxiety levels you might name it child.anxiety but should not name it child_anxiety and definitely not 
Child_Anxiety. However, Hadley (see the second URL at the end of this tip) recommends 'Variable names ... 
should be lowercase. Use _ to separate words within a name. ... Strive for concise but meaningful names’. In 
which case, child_anxiety would be fine. 

I tend to use an old programming convention of capitalizing all but the first word. So, I would name the variable 
childAnxiety, which waves its buttocks at the aforementioned conventions. I also sometimes use underscores 
... that’s just the kind of rebellious guy I am. 

The one thing that we can all agree on is that variable names should be meaningful and concise. This skill 
can take some time and effort to perfect, and I can imagine that you might think that it is a waste of your time. 
However, as you go through your course accumulating script files, you will be grateful that you did. Imagine you 
had a variable called ‘number of times I wanted to shoot myself during Andy Field’s statistics lecture’; then you 
might have called the variable ‘shoot’. All of your analysis and output will simply refer to ‘shoot’. That’s all well and 
good, but what happens in three weeks’ time when you look at your analysis again? The chances are that you’ll 
probably think ‘What did shoot stand for? Number of shots at goal? Number of shots I drank?’ Imagine the chaos 
you could get into if you had used an acronym for the variable ‘workers attending news kiosk’. Get into a good 
habit and spend a bit of time naming objects in R in a meaningful way. The aforementioned style guides might 
also help you to become more consistent than I am in your approach to naming: 


• http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html 

• https://github.com/hadley/devtools/wiki/Style 


Organizing your data © 


When inputting a new set of data, you must do so in a logical way. The most logical way 
(and consistent with other packages like SPSS and SAS) that we usually use is known as the 
wide format. In the wide format each row represents data from one entity while each col¬ 
umn represents a variable. There is no discrimination between independent and dependent 
variables: both types should be placed in a separate column. The key point is that each 
row represents one entity’s data (be that entity a human, mouse, tulip, business, or water 
sample). Therefore, any information about that case should be entered across the data 
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editor. For example, imagine you were interested in sex differences in perceptions of pain 
created by hot and cold stimuli. You could place some people’s hands in a bucket of very 
cold water for a minute and ask them to rate how painful they thought the experience was 
on a scale of 1 to 10. You could then ask them to hold a hot potato and again measure their 
perception of pain. Imagine I was a participant. You would have a single row representing 
my data, so there would be a different column for my name, my gender, my pain percep¬ 
tion for cold water and my pain perception for a hot potato: Andy, male, 7, 10. 

The column with the information about my gender is a grouping variable (also known as 
a factor): I can belong to either the group of males or the group of females, but not both. 
As such, this variable is a between-group variable (different entities belong to different 
groups). Rather than representing groups with words, R uses numbers and words. This 
involves assigning each group a number, and a label that descibes the group. Therefore, 
between-group variables are represented by a single column in which the group to which 
the person belonged is defined using a number and label (see section 3.5.4.3). For example, 
we might decide that if a person is male then we give them the number 0, and if they’re 
female we give them the number 1. We then have to tell R that every time it sees a 1 in a 
particular column the person is a female, and every time it sees a 0 the person is a male. 
Variables that specify to which of several groups a person belongs can be used to split up 
data files (so in the pain example you could run an analysis on the male and female partici¬ 
pants separately - see section 5.5.3). 

Finally, the two measures of pain are a repeated measure (all participants were subjected 
to hot and cold stimuli). Therefore, levels of this variable (see R’s Souls’ Tip 3.8) can be 
entered in separate columns (one for pain perception for a hot stimulus and one for pain 
perception for a cold stimulus). 



Entering data 0 


There is a simple rule for how variables are typically arranged in an R dataframe: data from different things go in 
different rows of the dataframe, whereas data from the same things go in different columns of the dataframe. As 
such, each person (or mollusc, goat, organization, or whatever you have measured) is represented in a different 
row. Data within each person (or mollusc, etc.) go in different columns. So, if you’ve prodded your mollusc, or 
human, several times with a pencil and measured how much it twitches as an outcome, then each prod will be 
represented by a column. 

In experimental research this means that any variable measured with the same participants (a repeated mea¬ 
sure) should be represented by several columns (each column representing one level of the repeated-measures 
variable). However, any variable that defines different groups of things (such as when a between-group design 
is used and different participants are assigned to different levels of the independent variable) is defined using 
a single column. This idea will become clearer as you learn about how to carry out specific procedures. (This 
golden rule is not as golden as it seems at first glance - often data need to be arranged in a different format - but 
it’s a good place to start and it’s reasonable easy to rearrange a dataframe - see section 3.9.) 


Imagine we were interested in looking at the differences between lecturers and students. 
We took a random sample of five psychology lecturers from the University of Sussex and 
five psychology students and then measured how many friends they had, their weekly 
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Table 3.6 Some data with which to play 


Name 

Birth Date 

Job 

No. of 
Friends 

Alcohol 

(units) 

Income (p.a.) 

Neuroticism 

Ben 

03-Jul-1977 

Lecturer 

5 

10 

20,000 

10 

Martin 

24-May-1969 

Lecturer 

2 

15 

40,000 

17 

Andy 

21-Jun-1973 

Lecturer 

0 

20 

35,000 

14 

Paul 

16-Jul-1970 

Lecturer 

4 

5 

22,000 

13 

Graham 

IO-Oct-1949 

Lecturer 

1 

30 

50,000 

21 

Carina 

05-Nov-1983 

Student 

10 

25 

5,000 

7 

Karina 

08-Oct-1987 

Student 

12 

20 

100 

13 

Doug 

16-Sep-1989 

Student 

15 

16 

3,000 

9 

Mark 

20-May-1973 

Student 

12 

17 

10,000 

14 

Zoe 

12-Nov-1984 

Student 

17 

18 

10 

13 


alcohol consumption (in units), their yearly income and how neurotic they were (higher 
score is more neurotic). These data are in Table 3.6. 


3.5.4.I. Creating a string variable © 

The first variable in our data set is the name of the lecturer/student. This variable consists 
of names; therefore, it is a string variable. We have seen how to create string variables 
already: we use the c() function and list all values in quotations so that R knows that it is 
string data. As such, we can create a variable called name as follows: 

name<-c("Ben", "Martin", "Andy", "Paul", "Graham", "Carina", "Karina", 
"Doug", "Mark", "Zoe") 

We do not need to specify the level at which this variable was measured (see section 
1.5.1.2) because R will automatically treat it as nominal because it is a string variable, and 
therefore represents only names of cases and provides no information about the order of 
cases, or the magnitude of one case compared to another. 


3.5.4.2. Creating a date variable 0 

Notice that the second column in our table contains dates (birth dates, to be exact). To 
enter date variables into R we use much the same procedure as with a string variable, except 
that we need to use a particular format, and we need to tell R that the data are dates if we 
want to do any date-related computations. We can convert dates written as text into date 
objects using the as.Date() function. This function takes strings of text, and converts them 
into dates; this is important if you want to do things like subtract dates from one another. 
For example, if you want to work out how old someone was when you tested him or her, 
you could take the date on which they were tested and subtract from it the date they were 
born. If you have not converted these objects from strings to date objects this subtraction 
won’t work (see R’s Souls’ Tip 3.9). 
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Dates © 


If you want to do calculations involving dates then you need to tell R to treat a variable as a date object. Let’s look 
at what happens if we don’t. Imagine two variables (husband and wife) that contain the birthdates of four men 
and their respective wives. We might create these variables and enter these birthdates as follows: 


husband<-c("1973-06-21", "1970-07-16", "1949-10-08", "1969-05-24") 
wife<-c("1984-ll-12", "1973-08-02", "1948-11-11", "1983-07-23") 


If we want to now calculate the age gap between these partners, then we could create a new variable, agegap, 
which is the difference between the two variables (husband - wife): 

agegap <- husband-wife 


We’d find this rather disappointing message in the console: 

Error in husband - wife : non-numeric argument to binary operator 


This message is R’s way of saying ‘What the hell are trying to get me to do? These are words; I can’t subtract 
letters from each other.’ 

However, if we use the as.DateQ function when we create the variables then R knows that the strings of text 
are dates: 


husband<-as.Date(c("1973-06-21", "1970-07-16", "1949-10-08", "1969-05-24")) 
wife<-as.Date(c("1984-ll-12", "1973-08-02", "1948-11-11", "1983-07-23")) 


If we try again to calculate the difference between the two variables: 
agegap <- husband-wife 


agegap 

we get a more sensible output: 

Time differences in days 
[1] -4162 -1113 331 -5173 

This output tells us that in the first couple the wife is 4162 days younger than her husband (about 11 years), for 
the third couple the wife is 331 days older (just under a year). 


The as.Date() function is placed around the function that we would normally use to enter 
a series of strings. Normally if we enter strings we use the form: 

variable<-c("string 1", "string 2", "string 3", etc.) 

For dates, these strings need to be in the form yyyy-mm-dd. In other words, if we want to 
enter the date 21 June 1973, then we would enter it as “1973-06-21”. As such, we could 
create a variable called birth_date containing the dates of birth by executing the following 
command: 

birth_date<-as.Date(c("1977-07-03", "1969-05-24", "1973-06-21", "1970-07-16", 
"1949-10-10", "1983-11-05", "1987-10-08", "1989-09-16", "1973-05-20", 
"1984-11-12")) 

Note that we have entered each date as a text string (in quotations) in the appropriate 
format (yyyy-mm-dd). By enclosing these data in the as.Date() function, these strings are 
converted to date objects. 
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3.5.4.3. Creating coding variables/factors © 

A coding variable (also known as a grouping variable or factor) is a variable that uses num¬ 
bers to represent different groups of data. As such, it is a numeric variable, but these num¬ 
bers represent names (i.e., it is a nominal variable). These groups of data could be levels 
of a treatment variable in an experiment, different groups of people (men or women, an 
experimental group or a control group, ethnic groups, etc.), different geographic locations, 
different organizations, etc. 

In experiments, coding variables represent independent variables that have been mea¬ 
sured between groups (i.e., different participants were assigned to different groups). If you 
were to run an experiment with one group of participants in an experimental condition 
and a different group of participants in a control group, you might assign the experimental 
group a code of 1 and the control group a code of 0. When you come to put the data into 
R you would create a variable (which you might call group) and type in the value 1 for 
any participants in the experimental group, and 0 for any participant in the control group. 
These codes tell R that all of the cases that have been assigned the value 1 should be treated 
as belonging to the same group, and likewise for the cases assigned the value 0. In situations 
other than experiments, you might simply use codes to distinguish naturally occurring 
groups of people (e.g., you might give students a code of 1 and lecturers a code of 0). These 
codes are completely arbitrary; for the sake of convention people typically use 0, 1, 2, 3, 
etc., but in practice you could have a code of 495 if you were feeling particularly arbitrary. 

We have a coding variable in our data: the one describing whether a person was a lec¬ 
turer or student. To create this coding variable, we follow the steps for creating a normal 
variable, but we also have to tell R that the variable is a coding variable/factor and which 
numeric codes have been assigned to which groups. 

First, we can enter the data and then worry about turning these data into a coding vari¬ 
able. In our data we have five lecturers (who we will code with 1) and five students (who 
we will code with 2). As such, we need to enter a series of Is and 2s into our new variable, 
which we’ll call job. The way the data are laid out in Table 3.6 we have the five lecturers 
followed by the five students, so we can enter the data as: 

job<-c(l,l,l,l,l,2,2,2,2,2) 

In situations like this, in which all cases in the same group are grouped together in the 
data file, we could do the same thing more quickly using the rep() function. This function 
takes the general form of rep(number to repeat, how many repetitions). As such, rep(l, 
S) will repeat the number 1 five times. Therefore, we could generate our job variable as 
follows: 

job<-c(rep(l, 5),rep(2, 5)) 

Whichever method you use the end results is the same: 
job 

[ 1 ] 1111122222 

To turn this variable into a factor, we use the factor() function. This function takes the 
general form: 

factor(variable, levels = c(x,y, ... z), labels = c("labell", "labe!2", ... 
"Iabel3")) 

This looks a bit scary, but it’s not too bad really. Let’s break it down: factor(variableName) 
is all you really need to create the factor - in our case factor(job) would do the trick. 
However, we need to tell R which values we have used to denote different groups and 
we do this with levels = c(l,2,3,4, ...)-, as usual we use the c() function to list the values 
we have used. If we have used a regular series such as 1, 2, 3, 4 we can abbreviate this 
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as c(l:4), where the colon simply means ‘all the values between’; so, c(l:4) is the same 
as c(l,2,3,4) and c(0:6) is the same as c(0,l,2,3,4,5,6). In our case, we used 1 and 2 to 
denote the two groups, so we could specify this as c(l:2) or c(l,2). The final step is to 
assign labels to these levels using labels = c(“label”, ...). Again, we use c() to list the labels 
that we wish to assign. You must list these labels in the same order as your numeric levels, 
and you need to make sure you have provided a label for each level. In our case, 1 cor¬ 
responds to lecturers and 2 to students, so we would want to specify labels of “Lecturer” 
and “Student”. As such, we could write levels = c(“Lecturers”, “Students”). If we put all 
of this together we get this command, which we can execute to transform job into a cod¬ 
ing variable: 

job<-factor(job, levels = c(l:2), labels = c("Lecturer" , "Student")) 

Having converted job to a factor, R will treat it as a nominal variable. A final way to gener¬ 
ate factors is to use the gl() function - the ‘gP stands for general (factor) levels. This func¬ 
tion takes the general form: 

newFactor<-gl(number of levels, cases in each level, total cases, labels = 
cC'Tabell", "label2"...)) 

which creates a factor variable called newFactor-, you specify the number of levels or groups 
of the factor, how many cases are in each level/group, optionally the total number of cases 
(the default is to multiply the number of groups by the number of cases per group), and 
you can also use the labels option to list names for each level/group. We could generate the 
variable job as follows: 

job<-gl(2, 5, labels = c("Lecturer", "Student")) 

The end result is a fully-fledged coding variable (or factor): 

[1] Lecturer Lecturer Lecturer Lecturer Lecturer Student Student Student 
Student Student 

With any factor variable you can see the factor levels and their order by using the levels() 
function, in which you enter the name of the factor. So, to see the levels of our variable job 
we could execute: 

levels(job) 

which will produce this output: 

[1] "Lecturer" "Student" 

In other words, we know that the variable job has two levels and they are (in this order) 
Lecturer and Student. We can also use this function to set the levels of a variable. For example, 
imagine we wanted these levels to be called Medical Lecturer and Medical Student, we 
could execute: 

levels(job)<-c("Medical Lecturer", "Medical Student") 

This command will rename the levels associated with the variable job (note, the new names 
are entered as text with speech marks, and are wrapped up in the c() function). You can also 
use this function to reorder the levels of a factor - see R’s Souls’ Tip 3.13. 

This example should clarify why in experimental research grouping variables are used 
for variables that have been measured between participants: because by using a coding 
variable it is impossible for a participant to belong to more than one group. This situation 
should occur in a between-group design (i.e., a participant should not be tested in both 
the experimental and the control group). However, in repeated-measures designs (within 
subjects) each participant is tested in every condition and so we would not use this sort of 
coding variable (because each participant does take part in every experimental condition) 
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3.5.4.4. Creating a numeric variable © 

Numeric variables are the easiest ones to create and we have already created several of 
these already in this chapter. Our next four variables are friends, alcohol, income and neu¬ 
rotic. These are all numeric variables and you can use what you have learnt so far to create 
them (I hope!). 



SELF-TEST 

V Use what you have learnt about creating variables in 
R to create variables called friends, alcohol, income 
and neurotic containing the data in Table 3.6. 


Hopefully you have tried out the exercise, and if so you should have executed the fol¬ 
lowing commands: 

friends<-0(5,2,0,4,1,10,12,15,12,17) 
alcohol<-c(10,15,20,5,30,25,20,16,17,18) 

income<-c(20000,40000,35000,22000,50000,5000,100,3000,10000,10) 
neurotic<-c(10,17,14,13,21,7,13,9,14,13) 



SELF-TEST 

S Having created the variables in Table 3.6, construct a 
dataframe containing them all called lecturerData. 


Having created the individual variables we can bind these together in a dataframe. We 
do this by executing this command: 

lecturerData<-data.frame(name,birth_date,job,friends,alcohol,income, 
neurotic) 

If we look at the contents of this dataframe you should hopefully see the same as Table 3.6: 
> lecturerData 



name 

birth_date 

job 

friends 

alcohol 

income 

neurotic 

1 

Ben 

1977-07-03 

Lecturer 

5 

10 

20000 

10 

2 

Martin 

1969-05-24 

Lecturer 

2 

15 

40000 

17 

3 

Andy 

1973-06-21 

Lecturer 

0 

20 

35000 

14 

4 

Paul 

1970-07-16 

Lecturer 

4 

5 

22000 

13 

5 

Graham 

1949-10-10 

Lecturer 

1 

30 

50000 

21 

6 

Carina 

1983-11-05 

Student 

10 

25 

5000 

7 

7 

Karina 

1987-10-08 

Student 

12 

20 

100 

13 

8 

Doug 

1989-09-16 

Student 

15 

16 

3000 

9 

9 

Mark 

1973-05-20 

Student 

12 

17 

10000 

14 

10 

Zoe 

1984-11-12 

Student 

17 

18 

10 

13 
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3.5.5. 


Missing values © 


Although as researchers we strive to collect complete sets of data, it is often the case that we 
have missing data. Missing data can occur for a variety of reasons: in long questionnaires 
participants accidentally (or, depending on how paranoid you’re feeling, deliberately just to 
annoy you) miss out questions; in experimental procedures mechanical faults can lead to a 
datum not being recorded; and in research on delicate topics (e.g., sexual behaviour) partici¬ 
pants may exert their right not to answer a question. However, just because we have missed 
out on some data for a participant doesn’t mean that we have to ignore the data we do have 
(although it sometimes creates statistical difficulties). Nevertheless, we do need to tell R that 
a value is missing for a particular case. The principle behind missing values is quite similar to 
that of coding variables in that we use a code to represent the missing data point. In R, the 
code we use is NA (in capital letters), which stands for ‘not available’. As such, imagine that 
participants 3 and 10 had not completed their neuroticism questionnaire, then we could 
have recorded their missing data as follows when we created the variable: 

neurotic<-c(10,17,NA,13,21,7,13,9,14,NA) 

Note that if you have missing values then you sometimes need to tell functions in R to 
ignore them (see R’s Souls’ Tip 3.10). 



Missing values and functions © 


Many functions include a command that tells R how to deal with missing values. For example, many functions 
include the command na.rm = TRUE , which means remove the NA values before doing the computation. For 
example, the function mean() returns the mean of a variable, so that 


mean(metallica$childAge) 


will give us the mean age of Metallica’s eldest children. However, if we have missing data we can include the 
command na.rm = TRUE to tell R to ignore missing values before computing the mean: 

mean(metallica$childAge, na.rm = TRUE) 

This function is covered in more detail in Chapter 5. For now, just appreciate that individual functions often have 
commands for dealing with missing values and that we will try to flag these as we go along. 


3.6. Entering data with R Commander © 


It is also possible to do some basic data editing (and analysis) using a package called Rcrndr 
(short for R Commander). This package loads a windows style interface for basic data 
manipulation and analysis. This tool is very useful for novices or people who are freaked 
out by typing commands. It is particularly useful for making minor changes to dataframes. 
To install and load Rcrndr, use the menus (see section 3.4.5) or execute these commands: 

install.packagesC'Rcmdr", dependencies = TRUE) 


library(Rcmdr) 
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It is important that you remember the capital ‘R’ in ‘Rcmdr’ (R’s Souls’ Tip 3.3). Note that 
when we install it we specify dependencies = TRUE. When a package uses other pack¬ 
ages, these are known as dependencies (because the package depends upon them to work). 
Rcmdr is a windows interface for using lots of different functions, therefore, it relies on 
a lot of other packages. If we don’t install all of these packages as well, then much of the 
functionality of Rcmdr will be lost. By setting dependencies = TRUE we install not just 
Rcmdr but also all of the other packages upon which it relies (because it uses a lot, installing 
it can take a few minutes). 4 



FIGURE 3.8 

The main 
window of R 
Commander 


When you have executed library (Rcmdr) you will notice that a new window appears 
(Figure 3.8). This window has a lot of new menus that you can access to do various things 
(such as edit data or run basic analyses). These menus offer a windows-based interface for 
running functions within different packages. We think that as you gain experience with R 
you will prefer to use commands, but for some commonly used analyses we will show you 
how to use R Commander to get you started. The menu structure is basically identical on 
Windows and MacOS. 


4 If you have installed other packages then it’s possible that Rcmdr has been installed by one of them; nevertheless, 
it is worth installing it yourself and including the setting dependencies = TRUE to ensure that all of the packages 
upon which Rcmdr depends are installed also. 
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Creating variables and entering data with R 
Commander © 


One particularly useful feature of R Commander is that it offers a basic spreadsheet style 
interface for entering data (i.e., like Excel). As such, we can enter data in a way that is prob¬ 
ably already familiar to us. To create a new dataframe select Data=>New data set..., which 
opens a dialog box that enables you to name the dataframe (Figure 3.9). For the lecturer 
data let’s stick with the name lecturerData-, enter this name into the box labelled Enter 
name for data set and then click on °* I. A spreadsheet style window will open. You can 
create variables by clicking at the top of a column, which opens a dialog box into which 
you can enter the name of the variable, and whether the variable is numeric or text/string 
(labelled character in the dialog box). Each row represents a different entity and, having 
named the variables, you can enter the relevant information for each entry - as shown for 
the current data in Figure 3.9. To save the data simply close this window. (You cannot cre¬ 
ate a new data set in this way in MacOS; however, you can edit an existing dataframe by 
selecting Data=>Load data set....) 


FIGURE 3.9 

Entering 
data using R 
Commander 
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► 



Active data set 

► 



Manage variables in active data set 

► 
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OK 


Cancel 


Help 
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name 9 

birth date 

job 

friends 

alcohoV 

income 

neurotic 

1 

Ben 

7/3/1977 

1 

5 

10 

20000 

10 

2 

Martin 

5/24/1969 

1 

2 

15 

40000 

17 

3 

Andy 

6/21/1973 

1 

0 

20 

35000 

14 

4 

Paul 

7/16/1970 

1 

4 

5 

22000 

13 

5 

Graham 

10/10/1949 

1 

1 

30 

50000 

21 

6 

Carina 

11/5/1983 

2 

10 

25 

5000 

7 

7 

Karina 

10/8/1987 

2 

12 

20 

100 

13 

8 

Doug 

1/23/1989 

2 

15 

16 

3000 

9 

9 

Mark 

5/20/1973 

2 

12 

17 

10000 

14 

10 

Zoe 

11/12/1984 

2 

17 

18 

10 

13 

11 
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3.6.2. 


Creating coding variables with R Commander © 


The variable job represents different groups of people so we need to convert this variable 
to a factor or coding variable. We saw how to do this in section 3.5.4.3 using the factor() 
function. We can do the same in R Commander by selecting the Data=>Manage variables 
in active data set=t>Convert numeric variables to factors... menu. This activates a dialog 
box with a list of the variables in your data set on the left (Figure 3.10). Select the variable 
that you want to convert (in this case job). If you want to create the coding variable as a 
new variable in your dataframe then type a name for this new variable in the space labelled 
New variable name or prefix for multiple variables: otherwise leave this space blank (as 
I have in the figure) and it will overwrite the existing variable. If you want to type some 
labels for the levels of your coding variable (generally I would recommend that you do) 
then select Supply level names (i o and click on ° K I. A new dialog box will open with spaces in 
which you can type the labels associated with each level of your coding variables. As you 
can see, in Figure 3.10 I have typed in ‘Lecturer’ and ‘Student’ next to the numbers that 
represent them in the variable job. When you have added these levels click on [ °* and 
job will be converted to a factor. 
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File Eda 
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FIGURE 3.10 

Creating a coding 
variable using R 
Commander 


3.7. Using other software to enter and edit data © 


Although you can enter data directly into R, if you have a large complicated data set then 
the chances are that you’ll want to use a different piece of software that has a spreadsheet 
style window into which you can enter data. We will assume in this section that you are 
going to use Microsoft Excel, because it is widely available and, therefore, it’s more likely 
that you have it on your computer than specialist packages such as SPSS and SAS. If you 
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want to know how to enter data into SPSS and SAS then please consult my other books 
(Field, 2009; Field & Miles, 2010). If you do not have Excel then OpenOffice is an excel¬ 
lent free alternative for both MacOS and Windows (http://www.openoffice.org/). 


I 


■ 



OLIVER TWISTED 

Please, Sir, can I 
have some more ... SPSS? 


M 


‘Secret Party for Statistics Slaves?’ froths Oliver as he drowns 
in a puddle of his own saliva. No, Oliver, it’s a statistics package. 
‘Bleagehammm’ splutters Oliver as his excitement grows into a rabid 
frenzy. If you would like to know how to set up data files in SPSS then 
there is an excerpt from my other book on the companion website. 


To enter data you will typically use the wide format so you should apply the same rule as 
we have already mentioned in this chapter: each row represents data from one entity while 
each column represents a variable or levels of a variable. In Figure 3.11 I have entered the 


FIGURE 3.11 

Laying out wide 
format data 
in Excel and 
exporting to an 
R-friendly format 


P Lecturer Data.xlsx 





























































CHAPTER 3 THE R ENVIRONMENT 


97 


lecturer data in Excel in this format. Notice that each person is represented by a row in the 
spreadsheet, whereas each variable is represented as a column. Notice also that I have entered 
the values for job as numbers rather than text. In Excel we could have entered ‘Lecturer’ 
and ‘Student’ rather than the values of 1 and 2. R will have imported this variable as a string 
variable in this case, rather than as a numeric variable. Often R will treat these sorts of string 
variables intelligently (i.e., in this case it would realize that this variable is a factor or coding 
variable and treat it accordingly), but it can be useful not to assume that R will do what you 
think it will and explicitly define variables as factors once the data have been imported. 


Importing data © 


Once your data are entered into Excel, OpenOffice, SPSS or whatever, we need a way to 
get the data file into a dataframe in R. The usual way to do this is to export the file from 
Excel/SPSS etc. in a format that R can import; however, the foreign package can be used to 
import directly data files from SPSS (.sav), STATA (.dta), Systat (. sys , .syd), Minitab (. mtp ), 
and SAS ( XPORT files). It is probably the safest (in terms of knowing that what you’re actu¬ 
ally importing is what you think you’re importing), to export from your software of choice 
into an R-friendly format. 

The two most commonly used R-friendly formats are tab-delimited text (.txt in Excel and 
.dat in SPSS) and comma-separated values (csv). Both are essentially plain text files (see R’s 
Souls’ Tip 3.11). It is very easy to export these types of files from Excel and other software 
packages. Figure 3.11 shows the process. Once the data are entered in the desired format, 
simply use the [S/SaveAs ... menus to open the Save As... dialog box. Select the location in 
which you’d like the file to be saved (a sensible choice is the working directory that you have 
set in R). By default, Excel will try to save the file as an Excel file (.x/sx or .x/s); however, we 
can change the format by clicking on the drop-down list labelled Save as type {Format on 
MacOS). The drop-down list contains a variety of file types, but the two that are best for R 
are Text (Tab delimited) and CSV (Comma delimited). Select one of these file types, type a 
name for your file and click on I save. |. The end result will be either a .txt file or a .csv file. 

The process for exporting data from SPSS (and other packages) is much the same. 

If we have saved the data as a CSV file, then we can import these data to a dataframe 
using the read.csv function. The general form of this function is: 

dataframe.name<-read.csv("filename.extension", header = TRUE) 

Let’s imagine we had stored our lecturer data in a CSV file called Lecturer Data.csv (you 
can find this file on the companion website). To load these data into a dataframe we could 
execute the following command: 

lecturerData = read.csv("C:/Users/Andy F/Documents/Data/R Book Examples/Lecturer 
Data.csv", header = TRUE) 

This command will create a dataframe called lecturerData based on the file called ‘Lecturer 
Data.csv’ which is stored in the location ‘C:/Users/Andy F/Documents/Data/R Book 
Examples/’. 5 I urged you in section 3.4.4 to set up a working directory that relates to the 
location of your files for the current session. If we executed this command: 6 


s For MacOS users the equivalent command would be 

lecturerData = read.csv("~/Documents/Data/R Book Examples/Lecturer Data.csv", 
header = TRUE) 

6 On a Mac the equivalent command would be 

setwd("~/Documents/Data/R Book Examples") 
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b 


R’s Souls’ Tip 3.11 

1 


CSV and tab-delimited file formats 0 


Comma-separated values (CSV) and tab-delimited file formats are really common ways to save data. Most soft¬ 
ware that deals with numbers will recognize these formats, and when exporting and importing data it is wise to 
chose one of them. The beauty of these formats is that they store the data as plain text, without any additional 
nonsense that might confuse a particular piece of software. The formats differ only in which character is used to 
separate different values (CSV uses a comma, tab-delimited uses a tab space). If we think back to our Metallica 
data, this would be stored in a tab-delimited file as: 


Name Age childAge fatherhoodAge 

Lars 47 12 35 

James 47 12 35 

Kirk 48 4 44 

Rob 46 6 40 


Notice that each piece of data is separated by a tab space. In a CSV file, the data would look like this: 

Name,Age,childAge,fatherhoodAge 

Lars,47,12,35 

James,47,12,35 

Kirk,48,4,44 

Rob,46,6,40 


The information is exactly the same as the tab-delimited file, except that a comma instead of a tab separates 
each value. When a piece of software (R, Excel, SPSS, etc.) reads the file like this into a spreadsheet, it knows 
(although sometimes you have to tell it) that when it ‘sees’ a comma or a tab it simply places the next value in a 
different column than the previous one. 


SetwdC'C:/Users/Andy F/Documents/Data/R Book Examples") 

then we could access the file by executing this less cumbersome command: 

lecturerData<-read.csv("Lecturer Data.csv", header = TRUE) 

The header = TRUE in the command tells R that the data file has variable names in the 
first row of the file (if you have saved the file without variable names then you should use 
header = FALSE). If you’re really struggling with the concept of file paths, which would be 
perfectly understandable, then see R’s Souls’ Tip 3.12. 

Let’s look at the data: 

> lecturerData 



name 

birth_date 

job 

friends 

alcohol 

income 

neurotic 

i 

Ben 

03 -Jul-77 

1 

5 

10 

20000 

10 

2 

Martin 

24-May-69 

1 

2 

15 

40000 

17 

3 

Andy 

21-Jun-73 

1 

0 

20 

35000 

14 

4 

Paul 

16-Jul-70 

1 

4 

5 

22000 

13 

5 

Graham 

10-Oct-49 

1 

1 

30 

50000 

21 

6 

Carina 

05-Nov-83 

2 

10 

25 

5000 

7 

7 

Karina 

08-Oct-87 

2 

12 

20 

100 

13 

8 

Doug 

23-Jan-89 

2 

15 

16 

3000 

9 

9 

Mark 

20-May-73 

2 

12 

17 

10000 

14 

10 

Zoe 

12-Nov-84 

2 

17 

18 

10 

13 


Note that the dates have been imported as strings, and the job variable contains num¬ 
bers. So that R knows that this variable is a factor we would have to convert it using the 
factorQ function. 
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SELF-TEST 

s Using what you have learnt about how to use the 
factor() function, see if you can work out how to 
convert the job variable to a factor. 


Similarly, if you had saved the file as a tab-delimited text file from Excel (Lecturer Data, 
txt) or SPSS (Lecturer Data.dat), you could use the read.delim() function to import these 
files. This function takes the same form as the read.csvQ function, except that you spec¬ 
ify a tab-delimited file. Assuming you had set your working directory correctly, we would 
execute: 

lecturerData<-read.delim("Lecturer Data.dat", header = TRUE) 
lecturerData<-read.delim("Lecturer Data.txt", header = TRUE) 

Typically we provide data files for chapters as .dat files, so you will use the read.delimQ 
function a lot. 



The file.choose() function © 


Some people really struggle with the idea of specifying file locations in R. This confusion isn’t a reason to be 
ashamed; most of us have spent our lives selecting files through dialog boxes rather than typing horribly long 
strings of text. Although if you set your working directory and manage your files I think the process of locating files 
becomes manageable, if you really can’t get to grips with that way of working the alternative is to use the choose. 
file() function. Executing this function opens a standard dialog box allowing you to navigate to the file you want. 

You can incorporate this function into read.csvQ and read.delimQ as follows: 


lecturerData<-read.csv(file.choose(), header = TRUE) 
lecturerData<-read.delim(file.choose(), header = TRUE) 


The effect that this has is that when you execute the command, a dialog box will appear and you can select the 
file that you want to import. 



Importing SPSS data files directly © 


You can also import directly from SPSS data files (and other popular packages). To give you 
some practice, we have provided the data as a .sav file (Lecturer Data.sav). First we need 
to install and load the package called foreign either using the menus (see section 3.4.5) or 
by executing these commands: 

install.packages("foreign") 
library(foreign) 

The command to read in SPSS data files is read.spssQ and it works in a similar way to the 
other import functions that we have already seen; however, there are a couple of extra 
things that we need to think about. First, let’s just execute the command to import our 
SPSS data file: 
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lecturerData<-read.spss("Lecturer Data.sav",use.value.labels=TRUE, to.data. 
frame=TRUE) 

The basic format is the same as before: we have created a dataframe called lecturerData, 
and we have done this from the file named Lecturer Data.sav. There are two additional 
instructions that we have used, the first is use.value.labels = TRUE. This command tells R 
that if a variable is set up as a factor or coding variable in SPSS then it should be imported 
as a factor. If you set this value to FALSE, then it is imported as a numeric variable (in 
this case you would get a variable containing Is and 2s). The second command is to.data. 
frame=TRUE, which self-evidently tells R to import the file as a dataframe. Without this 
command (or if it is set to FALSE), you get lots of horrible junk imported and nobody likes 
junk. Let’s have a look at the dataframe: 

> lecturerData 



name 

birth_date 

job 

friends 

alcohol 

income 

neurotic 

1 

Ben 

12456115200 

Lecturer 

5 

10 

20000 

10 

2 

Martin 

12200198400 

Lecturer 

2 

15 

40000 

17 

3 

Andy 

12328848000 

Lecturer 

0 

20 

35000 

14 

4 

Paul 

12236313600 

Lecturer 

4 

5 

22000 

13 

5 

Graham 

11581056000 

Lecturer 

1 

30 

50000 

21 

6 

Carina 

12656217600 

Student 

10 

25 

5000 

7 

7 

Karina 

12780028800 

Student 

12 

20 

100 

13 

8 

Doug 

12820896000 

Student 

15 

16 

3000 

9 

9 

Mark 

12326083200 

Student 

12 

17 

10000 

14 

10 

Zoe 

12688444800 

Student 

17 

18 

10 

13 


Two things to note: first, unlike when we imported the CSV file, job has been imported 
as a factor rather than a numeric variable (this is because we used the use.value.labels = 
TRUE command). Importing this variable as a factor saves us having to convert it in a sepa¬ 
rate command as we did for the CSV command. Second, the dates look weird. In fact, they 
look very weird. They barely even resemble dates. Unfortunately, the explanation for this 
is a little complicated and involves the way in which R stores dates (dates are stored as days 
relative to 1 January 1970 - don’t ask me why). What has happened is that R has actually 
been clever in noticing that birth_date was set up in SPSS as a date variable. Therefore, it 
has converted it into its own time format. To convert it back to a form that we can actually 
understand we need to execute this command: 

lecturerData$birth_date<-as.Date(as.POSIXct(lecturerData$birth_date, 

origin="1582-10-14")) 


This takes the variable birth_date from the lecturerData dataframe (lecturerData$birtb_ 
date) and re-creates it as a date variable. Hours poking around the Internet to work out 
the underlying workings of this command have led me to the conclusion that I should just 
accept that it works and not question the magic. Anyway, if we execute this command and 
have another look at the dataframe we find that the dates now appear as sensible dates: 

> lecturerData 



name 

birth_date 

job 

friends 

alcohol 

income 

neurotic 

1 

Ben 

1977-07-03 

Lecturer 

5 

10 

20000 

10 

2 

Martin 

1969-05-24 

Lecturer 

2 

15 

40000 

17 

3 

Andy 

1973-06-21 

Lecturer 

0 

20 

35000 

14 

4 

Paul 

1970-07-16 

Lecturer 

4 

5 

22000 

13 

5 

Graham 

1949-10-10 

Lecturer 

1 

30 

50000 

21 

6 

Carina 

1983-11-05 

Student 

10 

25 

5000 

7 

7 

Karina 

1987-10-08 

Student 

12 

20 

100 

13 

8 

Doug 

1989-01-23 

Student 

15 

16 

3000 

9 

9 

Mark 

1973-05-20 

Student 

12 

17 

10000 

14 

10 

Zoe 

1984-11-12 

Student 

17 

18 

10 

13 



CHAPTER 3 THE R ENVIRONMENT 


101 



Importing data with R Commander © 


You can access the read.delim(), read.csvQ and read.spssQ commands through the R 
Commander interface too. Select Data=>Import data to activate a submenu that enables 
you open a text file, SPSS, Minitab, STATA or Excel file (Figure 3.12). If you select a text 
file then a dialog box is opened into which you can place a name for the dataframe (in the 
box labelled Enter name for data set:), select whether variable names are included in the 
file, and what characters you have used to indicate missing values. By default, it assumes 
you want to open a file on your computer, and that a white space separates data values. For 
CSV and tab-delimited files you need to change this default to Commas or Tabs respec¬ 
tively (you can also specify a non-standard text character). Finally, by default it is assumed 
that a full stop denotes a decimal point, but in some locations a comma is used: if you live 
in one of these locations you should again choose the default. Having set these options, 
click on I ok to open a standard ‘open file’ dialog box, choose the file you want to open 
and then click on I ]. 

Opening an SPSS file is much the same except that there are fewer options (Figure 3.12). 
The dialog box for importing an SPSS file again asks for a name for the dataframe, but then 
asks only whether variables that you have set up as coding variables should be converted 
to factors (see section 3.5.4.3). The default is to say yes (which is the same as specifying 
use.value.labels = TRUE, see section 3.7.2). Again, once you have set these options, click 
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on I ok to open a standard dialog box that enables you to navigate to the file you want to 
open, select it, and then click on I J . 


Things that can go wrong © 


You can come across problems when importing data into R. One common problem is if you 
have used spaces in your variable names. Programs like SPSS don’t allow you to do this, but 
in Excel there are no such restrictions. One way to save yourself a lot of potential misery 
is just never to use variable names with spaces. Notice, for example, that for the variable 
birth_date I used an underscore (or ‘hard space’) to denote the space between the words; 
other people prefer to use a period (i.e., birth.date). Whatever you choose, avoiding spaces 
can prevent many import problems. 

Another common problem is if you forget to replace missing values with ‘NA’ in the data 
file (see section 3.5.5). If you get an error when trying to import, double-check that you 
have put ‘NA’ and not left missing values as blank. 

Finally, R imports variables with text in them intelligently: if different rows have the 
same text strings in them, R assumes that the variable is a factor and creates a factor 
variable with levels corresponding to the text strings. It orders these levels alphabetically. 
However, you might want the factor levels in a different order, in which case you need to 
reorder them - see R’s Souls’ Tip 3.13. 



Changing the order of groups in a factor variable © 


Imagine we imported a variable, job, that contained information about which of three jobs a person had in a 
hospital: ‘‘Porter", ‘‘Nurse’’, “Surgeon". R will order the levels alphabetically, so the resulting factor levels will be: 


1. Nurse 

2. Porter 

3. Surgeon 


However, you might want them to be ordered differently. For example, perhaps you consider a porter to be a 
baseline against which you want to compare nurses and surgeons. It might be useful to have porter as the first 
level rather than the second. 

We can reorder the factor levels by executing: 

variableName<-factor(variableName, levels = levels(variableName)[c(2, i ; 3)]) 


in which variableName is the name of the variable. For our job variable, this command would, therefore, be: 
job<-factor(job, levels = levels(job)[c(2, 1, 3)]) 

This command uses the factor() function to reorder the levels of the job variable. It re-creates the job variable 
based on itself, but then uses the levels() function to reorder the groups. We put the order of the levels that we'd like in 
the cQ function, so in this case we have asked for the levels to be ordered 2,1,3, which means that the current sec¬ 
ond group (porter ) will become the first group, the current first group (nurse) will become the second group and the 
current third group ( surgeon ) stays as the third group. Having executed this command, our groups will be ordered: 


1. Porter 

2. Nurse 

3. Surgeon 
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3.8. Saving data© 

Having spent hours typing in data, you might want to save it. As with importing data, 
you can export data from R in a variety of formats. Again, for the sake of flexibility we 
recommend exporting to tab-delimited text or CSV (see R’s Souls’ Tip 3.11) because these 
formats can be imported easily into a variety of different software packages (Excel, SPSS, 
SAS, STATA, etc.). To save data as a tab-delimited file, we use the write.table() command 
and for a CSV we can use write.csv(). 

The write.tableQ command takes the general form: 

write.tablefdataframe, "Filename.txt", sep="\t", row.names = FALSE) 

We replace dataframe with the name of the dataframe that we would like to save and 
“Filename.txt” with the name of the file. 7 The command sep = “” sets the character to be 
used to separate data values: whatever you place between the will be used to separate 
data values. As such, if we want to create a CSV file we could write sep = “ ” (which tells 
R to separate values with a comma), but to create a tab-delimited text file we would write 
sep = “\t” (where we have written \t between quotes, which represents the tab key), and 
we could also create a space-delimited text file by using sep = “ ” (note that there is a space 
between the quotes). Finally, row.names = FALSE just prevents R from exporting a column 
of row numbers (the reason for preventing this is because R does not name this column so 
it throws the variable names out of sync). Earlier on we created a dataframe called metal- 
lica. To export this dataframe to a tab-delimited text file called Metallica Data.txt, we 
would execute this command: 

write.table(metallica, "Metallica Data.txt", sep="\t", row.names = FALSE) 
The write.csv() command takes the general form: 
write. csv(dataframe, "Filename.csv") 

As you can see, it is much the same as the write.table() function. In fact, it is the write. 
table() function but with sep = as the default. 8 So, to save the metallica dataframe as a 
CSV file we can execute: 

write.csv(metallica, "Metallica Data.csv") 

3.9. Manipulating data © 

| Selecting parts of a dataframe © 


Sometimes (especially with large dataframes) you might want to select only a small portion 
of your data. This could mean choosing particular variables, or selecting particular cases. 
One way to achieve this goal is to create a new dataframe that contains only the variables 
or cases that you want. To select cases, we can execute the general command: 

newDataframe <- oldDataframe[rows, columns] 


7 Remember that if you have not set a working directory during your session then this filename will need to in¬ 
clude the full location information. For example, “C:/Users/Andy F/Documents/Data/R Book Examples/Filename, 
txt” or “'—/Documents/Data/R Book Examples/Filename.txt” in MacOS. Hopefully, it is becoming ever clearer 
why setting the working directory is a good thing to do. 

8 If you live in certain parts of western Europe, you might want to use write.csv2() instead which outputs the file 
in the format conventional for that part of the world: it uses to separate values, and c ,’ instead of V to represent 
the decimal point. 
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This command creates a new dataframe (called newDataframe) that contains the specified 
rows and columns from the old dataframe (called oldDataframe). Let’s return to our lec¬ 
turer data (in the dataframe that we created earlier called lecturerData)-, imagine that we 
wanted to look only at the variables that reflect some aspect of their personality (for exam¬ 
ple, alcohol intake, number of friends, and neuroticism). We can create a new dataframe 
(.lecturerPersonality ) that contains only these three variables by executing this command: 

lecturerPersonality <- lecturerData[, c("friends", "alcohol", "neurotic")] 

Note first that we have not specified rows (there is nothing before the comma); this means 
that all rows will be selected. Note also that we have specified columns as a list of variables 
with each variable placed in quotes (be careful to spell them exactly as they are in the 
original dataframe); because we want several variables, we put them in a list using the c() 
function. If you look at the contents of the new dataframe you’ll see that it now contains 
only the three variables that we specified: 

> lecturerPersonality 



friends 

alcohol 

neurot: 

1 

5 

10 

10 

2 

2 

15 

17 

3 

0 

20 

14 

4 

4 

5 

13 

5 

1 

30 

21 

6 

10 

25 

7 

7 

12 

20 

13 

8 

15 

16 

9 

9 

12 

17 

14 

10 

17 

18 

13 


Similarly, we can select specific cases of data by specifying an instruction for rows in 
the general function. This is done using a logical argument based on one of the operators 
listed in Table 3.5. For example, let’s imagine that we wanted to keep all of the variables, 
but look only at the lecturers’ data. We could do this by creating a new dataframe ( lecturer 
Only) by executing this command: 

lecturerOnly <- lecturerData[job=="Lecturer",] 

Note that we have not specified columns (there is nothing after the comma); this means 
that all variables will be selected. However, we have specified rows using the condition job 
= = “Lecturer”. Remember that the ' = =’ means ‘equal to’, so we have basically asked R 
to select any rows for which the variable job is exactly equal to the word ‘Lecturer’ (spelt 
exactly as we have). The new dataframe contains only the lecturers’ data: 

> lecturerOnly 



Name 

DoB 

job 

friends 

alcohol 

income 

neurotic 

1 

Ben 

1977-07-03 

Lecturer 

5 

10 

20000 

10 

2 

Martin 

1969-05-24 

Lecturer 

2 

15 

40000 

17 

3 

Andy 

1973-06-21 

Lecturer 

0 

20 

35000 

14 

4 

Paul 

1970-07-16 

Lecturer 

4 

5 

22000 

13 

5 

Graham 

1949-10-10 

Lecturer 

1 

30 

50000 

21 


We can be really cunning and specify both rows and columns. Imagine that we wanted 
to select the personality variables but only for people who drink more than 10 units of 
alcohol. We could do this by executing: 

alcoholPersonality <- lecturerData[alcohol > 10, c("friends", "alcohol", 
"neurotic")] 
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Note that we have specified rows using the condition alcohol >10, which means ‘select 
any cases for which the value of the variable alcohol is greater than 10. Also, we have speci¬ 
fied columns as in our original example, c(“friends”, “alcohol”, “neurotic”), which means 
we will select only the three listed variables. You’ll see that the new dataframe contains the 
same data as the lecturerPersonality dataframe except that cases 1 and 4 have been dropped 
because their scores on alcohol were not greater than 10: 

> alcoholPersonality 



friends 

alcohol 

neurot: 

2 

2 

15 

17 

3 

0 

20 

14 

5 

1 

30 

21 

6 

10 

25 

7 

7 

12 

20 

13 

8 

15 

16 

9 

9 

12 

17 

14 

10 

17 

18 

13 


3.9.2. 


Selecting data with the subsetQ function <D 


Another way to select parts of your dataframe is to use the subset() function. This function 
takes the general form: 

newDataframe<-subset(oldDataframe, cases to retain, select = c(list of 
variables)) 

Therefore, you create a new dataframe ( newDataframe ) from an exisiting dataframe (old- 
Dataframe). As in the previous section, you have to specify a condition that determines 
which cases are retained. This is usually some kind of logical argument based on one or 
more of the operators listed in Table 3.5; for example in our lecturerData if we wanted to 
retain cases who drank a lot we could set a condition of alcohol > 10, if we wanted neu¬ 
rotic alcoholics we could set a condition of alcohol > 10 &C neurotic > IS. The select com¬ 
mand is optional, but can be used to select specific variables from the original dataframe. 

Let’s re-create a couple of the examples from the previous section but using the subset() 
command. By comparing these commands to the ones in the previous section you can get 
an idea of the similarity between the methods. First, if we want to select only the lecturers’ 
data we could do this by executing: 

lecturerOnly <- subset(lecturerData, job=="Lecturer") 

Second, if we want to select the personality variables but only for people who drink more 
than 10 units of alcohol we could execute this command: 

alcoholPersonality <- subset(lecturerData, alcohol > 10, select = c("friends", 
"alcohol", "neurotic")) 

Note that we have specified rows using the condition alcohol >10, which means ‘select 
any cases for which the value of the variable alcohol is greater than 10’. Also, we have 
specified that we want only the variables friends, alcohol, and neurotic by listing them as 
part of the select command. The resulting lecturerPersonality dataframe will be the same as 
the one in the previous section. 

As a final point, it is worth noting that some functions have a subsetQ command within 
them that allows you to select particular cases of data in much the same way as we have 
done here (i.e., using logical arguments). 
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SELF-TEST 

S Using the lecturerData dataframe, create new 
dataframes containing (1) the name, income and job 
of anyone earning 10,000 or more; (2) the name, job, 
income and number of friends of anyone drinking 
12 units per week or less; and (3) all of the variables 
for those who drink 20 units or more or have a 
neuroticism score greater than 14. 


3.9.3. 


Dataframes and matrices (D 


So far in this chapter we have looked at storing data within dataframes. Dataframes are a 
useful way to store data because they can contain data of different types (i.e., both numeric 
and string variables). Sometimes, however, functions in R do not work on dataframes - 
they are designed instead to work on a matrix. Frankly, this is a nuisance. Luckily for us we 
can convert a dataframe to a matrix using the as.matrix() function. This function takes the 
general form: 

newMatrix <- as.matrix(dataframe) 

in which newMatrix is the matrix that you create, and dataframe is the dataframe from 
which you create it. 

Despite what Hollywood would have you believe, a matrix does not enable you to 
jump acrobatically through the air, Ninja style, as time seemingly slows so that you can 
gracefully contort to avoid high-velocity objects. I have worked with matrices many 
times, and I have never (to my knowledge) stopped time, and would certainly end up in 
a pool of my own innards if I ever tried to dodge a bullet. The sad reality is that a matrix 
is just a grid of numbers. In fact, it’s a lot like a dataframe. The main difference between 
a dataframe and a matrix is that a matrix can contain only numeric variables (it cannot 
contain string variables or dates). As such, we can convert only the numeric bits of a 
dataframe to a matrix. If you try to convert any string variables or dates, your ears will 
become turnips. Probably. 

If we want to create a matrix we have to first select only numeric variables. We did this 
in the previous section when we created the alcoholPersonality dataframe. Sticking with 
this dataframe then, we could convert it to a matrix (which I’ve called alcoholPersonality - 
Matrix) by executing this command: 

alcoholPersonalityMatrix <- as.matrix(alcoholPersonality) 

This command creates a matrix called alcoholPersonalityMatrix from the alcoholPersonal¬ 
ity dataframe. Remember from the previous section that alcoholPersonality was originally 
made up of parts of the lecturerData dataframe; it would be equally valid to create the 
matrix directly from this dataframe but selecting the bits that we want in the matrix just as 
we did when creating alcoholPersonality: 

alcoholPersonalityMatrix <- as.matrix(lecturerData[alcohol > 10, 
c("friends", "alcohol", "neurotic")]) 

Notice that the commands in the brackets are identical to those we used to create alcohol¬ 
Personality in the previous section. 






CHAPTER 3 THE R ENVIRONMENT 


107 


Reshaping data (D 


Once you have typed your data into R, Sod’s law says you’ll discover that it’s in the wrong 
format. Throughout this chapter we have taught you to use the wide format of data entry; 
however, there is another format known as the long or molten format. Figure 3.13 shows 
the difference between wide and long/molten format data. As we have seen, in wide format 
each person’s data is contained in a single row of the data. Scores on different variables are 
placed in different columns. In Figure 3.13, the first participant has a score of 32 on the first 


Index 

Variable 


Score 



FIGURE 3.13 

‘Wide’ format 
data places 
each person’s 
scores on 
several variables 
in different 
columns, 
whereas long 
format’ or 
‘molten’ data 
places scores for 
all variables in a 
single column 


Wide format 
unstack() 
castQ 


Long format/‘Molten’ Data 
stackf) 
meltf) 
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variable, a score of 12 in the second variable and a score of 25 on the third. In long/molten 
format, scores on different variables are placed in a single column. It’s as though the columns 
representing the different variables have been ‘stacked’ on top of each other to make a single 
column. Notice in the figure that the same scores are present but they are now in a single 
column. So that we know to what variable a score belongs a new variable has been added 
(called an index variable ) that indicates whether the score was from the first, second or third 
variable. If we look at our first participant again, we can see that his three scores of 32, 12 
and 25 are still present, but they are in the same column now; the index variable tells us to 
which variable the score relates. These formats are quite different, but fortunately there are 
functions that convert between the two. This final section looks at these functions. 

Let’s look at an example of people who had their life satisfaction measured at four points 
in time (if you want to know more about this example, see section 19.7.2). The data are in 
the file Honeymoon Period.dat. Let’s first create a dataframe called satisfactionData based 
on this file by executing the following command: 

satisfactionData = read .delim("Honeymoon Period.dat", header = TRUE) 

Figure 3.14 shows the contents of this dataframe. The data have been inputted in wide 
format: each row represents a person. Notice also that four different columns represent 
the repeated-measures variable of time. However, there might be a situation (such as in 
Chapter 19), where we need the variable Time to be represented by a single column (i.e., 
in long format). This format is shown in Figure 3.15. To put the hypothetical example in 
Figure 3.13 into a real context, let’s again compare the two data structures. 


FIGURE 3.14 

The life 

satisfaction data 
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In the wide format (Figure 3.14), each person is represented by a single row of data. 
Their life satisfaction is represented at four points in time by four columns. In contrast, 
the long format (Figure 3.15) replaces the four columns representing different time points 
with two new variables: 
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R Data Editor 
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FIGURE 3.15 

The life 

satisfaction data 
in the long’ or 
‘molten’ form 9 


• An outcome variable: This variable contains the scores for each person at each time 
point. In this case it contains all of the values for life satisfaction that were previously 
contained in four columns. It is the column labelled ‘value’ in Figure 3.15. 

• An index variable: A variable that tells you from which column the data originate. It 
is the column labelled ‘variable’ in Figure 3.15. Note that it takes on four values that 
represent baseline, 6 months, 12 months and 18 months. As such, this variable con¬ 
tains information about the time point to which each life satisfaction score belongs. 

Each person’s data, therefore, is now represented by four rows (one for each time point) 
instead of one. Variables such as Gender that are invariant over the time points have the 
same value within each person at each time point; however, our outcome variable (life sat¬ 
isfaction) does vary over the four time points (the four rows for each person). 


9 If you look at your own data then you will probably see something a bit different because your data will be 
ordered by variable. I wanted to show how each person had 4 rows of data so I created a new dataframe (restruc- 
turedData.sorted) that sorted the data by Person rather than Time; I did this using: restructuredData.sorted<-re 
structuredData [order (Person),]. 
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FIGURE 3.16 

The satisfaction 
data after 
running 
the stack() 
command. 
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To change between wide and long data formats we can use the melt() and cast() com¬ 
mands from the reshape package, or for simple data sets we can use stack() and unstack(). 
Let’s look at stack() and unstack() first. These functions pretty much do what they say on 
the tin: one of them stacks columns and the other unstacks them. We can use the stack 
function in the following general form: 

newDataFrame<-stack(oldDataFrame, select = c(variable list)) 

In short, we create a new dataframe based on an existing one. The select = c() is optional, 
but is a way to select a subset of variables that you want to stack. So, for the current data, 
we want to stack only the life satisfaction scores (we do not want to stack Gender as well). 
Therefore, we could execute: 

satisfactionStacked<-stack(satisfactionData, select = c("Satisfaction_ 
Base", "Satisfaction_6_Months", "Satisfaction_12_Months", "Satisfaction_ 
18_Months")) 

This command will create a dataframe called satisfactionStacked, which is the variables 
Satisfaction_Base, Satisfaction_6_Months, Satisfaction_12_Months, and Satisfaction_18_ 
Months from the dataframe satisfactionData stacked up on top of each other. You can see 
the result in Figure 3.16 or by executing: 

satisfactionStacked 

Notice in Figure 3.16 that the scores for life satisfaction are now stored in a single column 
(called values), and an index variable (called ind) has been created that tells us from which 
column the data originate. If we want to undo our handy work, we can use the unstack() 
function in much the same way: 

satisfactionUnstackedc-unstack(satisfactionStacked) 

Executing this command creates a new dataframe called satisfactionUnstacked that is based 
on unstacking the satisfactionStacked dataframe. In this case, R could make an intelligent 
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guess at how to unstack the data because we’d just used the stackQ function to create it; 
however, sometimes you will need to tell Rhow to do the unstacking. In this case, the com¬ 
mand takes the following general form: 

newDataFrame<-unstack(oldDataFrame, scores ~ columns) 

in which scores is the name of the variable containing your scores (for 
our current dataframe this is values) and columns is the name of the 
variable that indicates the variable to which the score belongs (ind in 
the current dataframe). Therefore, to make sure it’s going to unstack in 
the way we want it to, we could fully specify the function as: 

satisfactionUnstacked<-unstack(satisfactionStacked, values 
~ ind) 

Note that values~ind tells R that within the satisfactionStacked 
dataframe, values contains the scores to be unstacked, and ind indicates 
the columns into which these scores are unstacked. 

The stack() and unstack() functions are fine for simple operations, but to gain more con¬ 
trol over the data restructuring we should use the reshape package. To install this package 
execute: 

install.packagesC'neshape") 
library(reshape) 

This package contains two functions: melt() for ‘melting’ wide data into the long format, 
and castQ for ‘casting’ so-called molten data (i.e., long format) into a new form (in our cur¬ 
rent context we’ll cast it into a wide format, but you can do other things too). 

To restructure the satisfactionData dataframe we create a new dataframe (which I have 
unimaginatively called restructuredData) . This dataframe is based on the existing data ( sat¬ 
isfactionData ), but we use melt() to turn it into ‘molten’ data. This function takes the 
general form: 

newDataFrame<-melt(oldDataFrame, id = c(constant variables), measured = 
c(variables that change across columns)) 

We will have a look at each option in turn: 

• id: This option specifies any variables in the dataframe that do not vary over time. 
For these data we have two variables that don’t vary over time, the first is the person’s 
identifier (Person), and the second is their gender (Gender). We can specify these 
variables as id = c(“Person”, “Gender”). 

• measured: This option specifies the variables that do vary over time or are 
repeated measures (i.e., scores within the same entity). In other words, it speci¬ 
fies the names of variables currently in different columns that you would like to 
be restructured so that they are in different rows. We have four columns that we 
want to restructure (Satisfaction_Base, Satisfaction_6_Months, Satisfaction_12_ 
Months, Satisfaction_18_Months). These can be specified as: measured = c 
(“Satisfaction_Base”, “Satisfaction_6_Months”, “Satisfaction_12_Months”, “Satisfaction_ 
18_Months”). 

If we piece all of these options together, we get the following command: 

restructuredData<-melt(satisfactionData, id = c("Person", "Gender"), mea¬ 
sured = c("Satisfaction_Base", "Satisfaction_6_Months", "Satisfaction_12_ 
Months", "Satisfaction_18_Months")) 
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If you execute this command, you should find that your data has been restructured to look 
like Figure 3.15. 

To get data from a molten state into the wide format we use the cast() function, which 
takes the general form: 

newData<-cast(moltenData, variables coded within a single column ~ 
variables coded across many columns, value = "outcome variable") 

This can be quite confusing. Essentially you write a formula that specifies on the left any 
variables that do not vary within an entity. These are the variables that we specified in the id 
option when we made the data molten. In other words, they are things that do not change 
(such as name, gender) and that you would enter as a coding variable in the wide format. On 
the right-hand side of the formula you specify any variable that represents something that 
changes within the entities in your data set. These are the variables that we specified in the 
measured option when we made the data molten. So these could be measures of the same vari¬ 
able taken at different time points (such as in a repeated-measures or longitudinal design). In 
other words, this is the variable that you would like to be split across multiple columns in the 
wide format. The final option, value, enables you to specify a variable in the molten data that 
contains the actual scores. In our current example we have only one outcome variable so we 
don’t need to include this option (R will work out which column contains the scores), but it 
is useful to know about if you have more complicated data sets that you want to restructure. 

If we look at the data that we have just melted (restructuredData) , we have four variables 
(Figure 3.15): 

• Person: This variable tells us to which person the data belong. Therefore, this variable 
does not change within an entity (it identifies them). 

• Gender: This variable tells us the gender of a person. This variable does not change 
within an entity (for a given person its value does not change). 

• variable: This variable identifies different time points at which life satisfaction was 
measured. As such it does vary within each person (note that each person has four 
different time points within the column labelled ‘variable’). 

• value: This variable contains the life satisfaction scores. 

Given that we put variables that don’t vary on the left of the formula and those that do on the 
right, we need to put Gender and Person on the left, and variable on the right; our formula 
will, therefore, be ‘Person + Gender ~ variable’. The variable called value contains the scores 
that we want to restructure, so we can specify this by including the option value = “value” 
(although note that because we have only one outcome variable we actually don’t need this 
option, I’m including it just so you understand what it does). Our final command will be: 

wideData<-cast(restructuredData, Person + Gender ~ variable, value = "value") 

Executing this command creates a new dataframe ( ivideData ) that should, hopefully, look 
a bit like Figure 3.14. 



OLIVER TWISTED 

Please, Sir, can I 
have some more ... data 
restructuring? 


‘Why don’t you teach us about reshapef)?' taunts Oliver. ‘Is 
it because your brain is the size of a grape?’ No, Oliver, it’s 
because I think castQ and melt() are simpler. ‘Grape brain, grape 
brain, grape brain...' sings Oliver as I reach for my earplugs. 
It is true that there is a reshape() function that can be used to 
restructure data; there is a tutorial on the companion website. 
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What have I discovered about statistics? © 


This chapter has provided a basic introduction to the R environment. We’ve seen that R is 
free software that you can download from the Internet. People with big brains contribute 
packages that enable you to carry out different tasks in R. They upload these packages to 
a mystical entity known as the CRAN, and you download them from there into your com¬ 
puter. Once you have installed and loaded a package you can use the functions within it. 

We also saw that R operates through written commands. When conducting tasks in 
R, you write commands and then execute them (either in the console window, or using 
a script file). It was noteworthy that we learned that we cannot simply write “R, can 
you analyse my data for me please” but actually have to use specific functions and com¬ 
mands. Along the way, we discovered that R will do its best to place obstacles in our 
way: it will pedantically fail to recognize functions and variables if they are not written 
exactly as they should be, it will spew out vitriolic error messages if we miss punctuation 
marks, and it will act aloof and uninterested if we specify incorrectly even the smallest 
detail. It believes this behaviour to be character building. 

You also created your first data set by specifying some variables and inputting some data. 
In doing so you discovered that we can code groups of people using numbers (coding vari¬ 
ables) and discovered that rows in the data represent different entities (or cases of data) and 
columns represent different variables. Unless of course you use the long format, in which 
case a completely different set of rules apply. That’s OK, though, because we learnt how to 
transform data from wide to long format. The joy that brought to us can barely be estimated. 

We also discovered that I was scared of my new school. However, with the help of 
Jonathan Land my confidence grew. With this new confidence I began to feel comfort¬ 
able not just at school but in the world at large. It was time to explore. 


R packages used in this chapter 


foreign 

Rcmdr 

R functions used in this chapter 

as.DateO 
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as.matrix() 

print() 
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Key terms that I’ve discovered 


Console window 

CRAN 

Dataframe 

Date variable 

Editor window 

Factor 

Function 

Graphics window 
Long format data 
Matrix 


Numeric variable 

Package 

Object 

Quartz window 
script 

String variable 
Wide format data 
Working directory 
Workspace 



Smart Alex’s tasks 


• Task 1: Smart Alex’s first task for this chapter is to save the data that you’ve entered 
in this chapter. Save it somewhere on the hard drive of your computer (or a USB 
stick if you’re not working on your own computer). Give it a sensible title and save 
it somewhere easy to find (perhaps create a folder called ‘My Data Files’ where you 
can save all of your files when working through this book). 


• Task 2: Your second task is to enter the data below. These data show the score (out 
of 20) for 20 different students, some of whom are male and some female, and some 
of whom were taught using positive reinforcement (being nice) and others who were 
taught using punishment (electric shock). Just to make it hard, the data should not be 
entered in the same way that they are laid out below: 


Male 


Female 

Electric Shock 

Being Nice 

Electric Shock 

Being Nice 

15 

10 

6 

12 

14 

9 

7 

10 

20 

8 

5 

7 

13 

8 

4 

8 

13 

7 

8 

13 


• Task 3: Research has looked at emotional reactions to infidelity and found that men 
get homicidal and suicidal and women feel undesirable and insecure (Shackelford, 
LeBlanc, & Drass, 2000). Let’s imagine we did some similar research: we took some 
men and women and got their partners to tell them they had slept with someone else. 
We then took each person to two shooting galleries and each time gave them a gun 
and 100 bullets. In one gallery was a human-shaped target with a picture of their own 
face on it, and in the other was a target with their partner’s face on it. They were left 
alone with each target for 5 minutes and the number of bullets used was measured. 
The data are below; enter them into R and save them as Infidelity.csv (clue: they are 
not entered in the format in the table!). 








CHAPTER 3 THE R ENVIRONMENT 


115 


Male 

Partner’s Face 

Own Face 

Female 

Partner’s Face 

Own Face 

69 

33 

70 

97 

76 

26 

74 

80 

70 

10 

64 

88 

76 

51 

43 

100 

72 

34 

51 

100 

65 

28 

93 

58 

82 

27 

48 

95 

71 

9 

51 

83 

71 

33 

74 

97 

75 

11 

73 

89 

52 

14 

41 

69 

34 

46 

84 

82 



Answers can be found on the companion website. 


Further reading 


There are many good introductory R books on the market that go through similar material to this 
chapter. Here a few: 

Crawley, M. (2007). The R book. Chichester: Wiley. (A really good and thorough book. You could 
also try his Statistics: An Introduction Using R, published by Wiley in 2005.) 

Venables, W N., & Smith, D. M., and the R Development Core Team (2002). An introduction to R. 
Bristol: Network Theory. 

Zuur, A. F., Ieno, E. N., & Meesters, E. H. W G. (2009) A beginner’s guide to R. Dordrecht: 
Springer-Verlag. 

There are also many good web resources: 

• The main project website: http://www.r-project.org/ 

• Quick-R, a particular favourite of mine, is an excellent introductory website: http://www.stat 
methods.net/index.htm 

• John Fox’s R Commander website: http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/ 






Exploring data with graphs 



FIGURE 4.1 

Explorer Field 
borrows a bike 
and gets ready to 
ride it recklessly 
around a caravan 
site 



4.1. What will this chapter tell me? © 


As I got a bit older I used to love exploring. At school they would teach you about maps 
and how important it was to know where you were going and what you were doing. I 
used to have a more relaxed view of exploration and there is a little bit of a theme of me 
wandering off to whatever looked most exciting at the time. I got lost at a holiday camp 
once when I was about 3 or 4. I remember nothing about this but apparently my parents 
were frantically running around trying to find me while I was happily entertaining myself 
(probably by throwing myself head first out of a tree or something). My older brother, who 
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was supposed to be watching me, got a bit of flak for that but he was probably working out 
equations to bend time and space at the time. He did that a lot when he was 7. The careless 
explorer in me hasn’t really gone away: in new cities I tend to just wander off and hope for 
the best, and usually get lost and fortunately usually don’t die (although I tested my luck 
once by wandering through part of New Orleans where apparently tourists get mugged a 
lot - it seemed fine to me). When exploring data you can’t afford not to have a map; to 
explore data in the way that the 6-year-old me used to explore the world is to spin around 
8000 times while drunk and then run along the edge of a cliff. Wright (2003) quotes 
Rosenthal who said that researchers should ‘make friends with their data’. This wasn’t 
meant to imply that people who use statistics may as well befriend their data because the 
data are the only friend they’ll have; instead Rosenthal meant that researchers often rush 
their analysis. Wright makes the analogy of a fine wine: you should savour the bouquet and 
delicate flavours to truly enjoy the experience. That’s perhaps overstating the joys of data 
analysis, but rushing your analysis is, I suppose, a bit like gulping down a bottle of wine: 
the outcome is messy and incoherent! To negotiate your way around your data you need a 
map. Maps of data are called graphs, and it is into this tranquil and tropical ocean that we 
now dive (with a compass and ample supply of oxygen, obviously). 


4.2. The art of presenting data © 

Why do we need graphs © 


Graphs are a really useful way to look at your data before you get to 
the nitty-gritty of actually analysing them. You might wonder why you 
should bother drawing graphs - after all, you are probably drooling 
like a rabid dog to get into the statistics and to discover the answer 
to your really interesting research question. Graphs are just a waste 
of your precious time, right? Data analysis is a bit like Internet dating 
(actually it’s not, but bear with me): you can scan through the vital 
statistics and find a perfect match (good IQ, tall, physically fit, likes 
arty French films, etc.) and you’ll think you have found the perfect 
answer to your question. However, if you haven’t looked at a picture, 
then you don’t really know how to interpret this information - your 
perfect match might turn out to be Rimibald the Poisonous, King of the 
Colorado River Toads, who has genetically combined himself with a 
human to further his plan to start up a lucrative rodent farm (they like 
to eat small rodents). 1 Data analysis is much the same: inspect your data with a picture, see 
how it looks and only then think about interpreting the more vital statistics. 




What makes a good graph? © 


Before we get down to the nitty-gritty of how to draw graphs in R, I want to begin by 
talking about some general issues when presenting data. R (and other packages) make 
it very easy to produce very snazzy-looking graphs, and you may find yourself losing 


On the plus side, he would have a long sticky tongue and if you smoke his venom (which, incidentally, can kill 
. dog) you’ll hallucinate (if you’re lucky, you’d hallucinate that he wasn’t a Colorado river toad-human hybrid). 







118 


DISCOVERING STATISTICS USING R 


consciousness at the excitement of colouring your graph bright pink (really, it’s amaz¬ 
ing how excited my undergraduate psychology students get at the prospect of bright pink 
graphs - personally I’m not a fan of pink). Much as pink graphs might send a twinge of 
delight down your spine, I want to urge you to remember why you’re doing the graph - 
it’s not to make yourself (or others) purr with delight at the pinkness of your graph, it’s to 
present information (dull, perhaps, but true). 

Tufte (2001) wrote an excellent book about how data should be presented. He points out 
that graphs should, among other things: 

• Show the data. 

• Induce the reader to think about the data being presented (rather than some other 
aspect of the graph, like how pink it is). 

• Avoid distorting the data. 

• Present many numbers with minimum ink. 

• Make large data sets (assuming you have one) coherent. 

• Encourage the reader to compare different pieces of data. 

• Reveal data. 


However, graphs often don’t do these things (see Wainer, 1984, for some examples). 

Let’s look at an example of a bad graph. When searching around for the worst example 
of a graph that I have ever seen, it turned out that I didn’t need to look any further than 
myself - it’s in the first edition of the SPSS version of this book (Field, 2000). Overexcited 
by SPSS’s ability to put all sorts of useless crap on graphs (like 3-D effects, fill effects and 
so on - Tufte calls these chartjunk), I literally went into some weird orgasmic state and 
produced an absolute abomination (I’m surprised Tufte didn’t kill himself just so he could 
turn in his grave at the sight of it). The only consolation was that because the book was 
published in black and white, it’s not pink! The graph is reproduced in Figure 4.2 (you 


FIGURE 4.2 

A cringingly bad 
example of a 
graph from the 
first edition of the 
SPSS version of 
this book 


Error Bars show 95.0 % Cl of Hean 
Bars show Means 
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should compare this to the more sober version in this edition, Figure 16.4). What’s wrong 
with this graph? 

* The bars have a 3-D effect: Never use 3-D plots for a graph plotting two variables: it 
obscures the data. 2 In particular it makes it hard to see the values of the bars because 
of the 3-D effect. This graph is a great example because the 3-D effect makes the 
error bars almost impossible to read. 

x Patterns: The bars also have patterns, which, although very pretty, merely distract the 
eye from what matters (namely the data). These are completely unnecessary. 

x Cylindrical bars: What’s that all about, eh? Again, they muddy the data and distract 
the eye from what is important. 

x Badly labelled y-axis: ‘number’ of what? Delusions? Fish? Cabbage-eating sea lizards 
from the eighth dimension? Idiots who don’t know how to draw graphs? 

Now take a look at the alternative version of this graph (Figure 4.3). Can you see what 
improvements have been made? 

•S A 2-D plot: The completely unnecessary third dimension is gone, making it much 
easier to compare the values across therapies and thoughts/behaviours. 

•S The y-axis has a more informative label: We now know that it was the number of 
obsessive thoughts or actions per day that was being measured. 

•S Distractions: There are fewer distractions like patterns, cylindrical bars and the like! 

•S Minimum ink: I’ve got rid of superfluous ink by getting rid of the axis lines and by 
using lines on the bars rather than grid lines to indicate values on the y-axis. Tufte 
would be pleased. 3 


9? Q 


■Q 

o 


18 

16 

14 

12 

10 


E D) 


Thoughts Error Bars Show 95% Cl 
Actions 


Jl 


. i 


T 


CBT BT No Treatment 

Therapy 


FIGURE 4.3 

Figure 4.2 drawn 
properly 


2 If you do 3-D plots when you’re plotting only two variables then a bearded statistician will come to your house, 
lock you in a room and make you write I pnox VOX 8o 3-A yptrjrr|0 75,172 times on the blackboard. Really, they will. 

3 Although he probably over-prescribes this advice: grid lines are more often than not very useful for interpreting 
the data. 
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Lies, damned lies, and ... erm ... graphs © 


Governments lie with statistics, but scientists shouldn’t. How you present your data makes 
a huge difference to the message conveyed to the audience. As a big fan of cheese, I’m often 
curious about whether the urban myth that it gives you nightmares is true. Shee (1964) 
reported the case of a man who had nightmares about his workmates: ’He dreamt of one, 
terribly mutilated, hanging from a meat-hook. 4 Another he dreamt of falling into a bottom¬ 
less abyss. When cheese was withdrawn from his diet the nightmares ceased.’ This would 
not be good news if you were the minister for cheese in your country. 


FIGURE 4.4 

Two graphs 
about cheese 



Cheese Group Cheese Group 



Figure 4.4 shows two graphs that, believe it or not, display exactly the same data: the 
number of nightmares had after eating cheese. The left-hand panel shows how the graph 
should probably be scaled. The y-axis reflects the maximum of the scale, and this cre¬ 
ates the correct impression: that people have more nightmares about colleagues hanging 
from meat-hooks if they eat cheese before bed. However, as minister for cheese, you want 
people to think the opposite; all you have to do is rescale the graph (by extending the 
y-axis way beyond the average number of nightmares) and there suddenly seems to be little 
difference. Tempting as it is, don’t do this (unless, of course, you plan to be a politician at 
some point in your life). 


CRAMMING SAM’S TIPS 


Graphs © 


s The vertical axis of a graph is known as the y-axis of the graph. 
s The horizontal axis of a graph is known as the x-axis of the graph. 


If you want to draw a good graph follow the cult of Tufte: 


s Don’t create false impressions of what the data actually show (likewise, don’t hide effects!) by scaling the y-axis in some 
weird way. 

•s Abolish chartjunk: Don’t use patterns, 3-D effects, shadows, pictures of spleens, photos of your Uncle Fred or anything else. 
s Avoid excess ink: don’t include features unless they are necessary to interpret or understand the data. 


4 I have similar dreams, but that has more to do with some of my workmates than cheese! 
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4.3. Packages used in this chapter © 


The basic version of R comes with a plot() function, which can create a wide variety of 
graphs (type ?plot in the command line for details) and the lattice() package is also helpful. 
However, throughout this chapter I use Hadley Wickham’s ggplot2 package (Wickham, 
2009). I have chosen to focus on this package because I like it. I wouldn’t take it out for 
a romantic meal, but I do get genuinely quite excited by some of the stuff it can do. Just 
to be very clear about this, it’s a very different kind of excitement than that evoked by a 
romantic meal with my wife. 

The ggplot2 package excites me because it is a wonderfully versatile tool. It takes a bit of 
time to master it (I still haven’t really got to grips with the finer points of it), but once you 
have, it gives you an extremely flexible framework for displaying and annotating data. The 
second great thing about ggplotl is it is based on Tufte’s recommendations about displaying 
data and Wilkinson’s grammar of graphics (Wilkinson, 2005). Therefore, with basically no 
editing we can create Tufte-pleasing graphs. You can install ggplot2 by executing the fol¬ 
lowing command: 

install.packages("ggplot2") 

You then need to activate it by executing the command: 
library(ggplot2) 


4.4. Introducing ggplot2 © 


There are two ways to plot graphs with ggplot2: (1) do a quick plot using the qplotQ func¬ 
tion; and (2) build a plot layer by layer using the ggplotQ function. Undoubtedly the qplot() 
function will get you started quicker; however, the ggplot() function offers greater versatil¬ 
ity so that is the function that I will use throughout the chapter. I like a challenge. 

There are several concepts to grasp that help you to understand how ggplot2 builds 
graphs. Personally, I find some of the terminology a bit confusing so I apologize if occasion¬ 
ally I use different terms than those you might find in the ggplot2 documentation. 


The anatomy of a plot © 


A graph is made up of a series of layers. You can think of a layer as a plastic transparency 
with something printed on it. That ‘something’ could be text, data points, lines, bars, pic¬ 
tures of chickens, or pretty much whatever you like. To make a final image, these transpar¬ 
encies are placed on top of each other. Figure 4.5 illustrates this process: imagine you begin 
with a transparent sheet that has the axes of the graph drawn on it. On a second transpar¬ 
ent sheet you have bars representing different mean scores. On a third transparency you 
have drawn error bars associated with each of the means. To make the final graph, you put 
these three layers together: you start with the axes, lay the bars on top of that, and finally 
lay the error bars on top of that. The end result is an error bar graph. You can extend the 
idea of layers beyond the figure: you could imagine having a layer that contains labels for 
the axes, or a title, and again, you simply lay these on top of the existing image to add more 
features to the graph. 

As can be seen in Figure 4.5, each layer contains visual objects such as bars, data points, 
text and so on. Visual elements are known as geoms (short for ‘geometric objects’) in 
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ggplotZ. Therefore, when we define a layer, we have to tell R what geom we want displayed 
on that layer (do we want a bar, line dot, etc.?). These geoms also have aesthetic properties 
that determine what they look like and where they are plotted (do we want red bars or 
green ones? do we want our data point to be a triangle or a square? etc.). These aesthetics 
( aes() for short) control the appearance of graph elements (for example, their colour, size, 
style and location). Aesthetics can be defined in general for the whole plot, or individually 
for a specific layer. We’ll come back to this point in due course. 


FIGURE 4.5 

In ggplot2 a plot 
is made up of 
layers 


Plot 



To recap, the finished plot is made up of layers, each layer contains some geometric 
element (such as bars, points, lines, text) known as a geom, and the appearance and loca¬ 
tion of these geoms (e.g., size, colour, shape used) is controlled by the aesthetic properties 
(aesQ). These aesthetics can be set for all layers of the plot (i.e., defined in the plot as a 
whole) or can be set individually for each geom in a plot (Figure 4.6). We will learn more 
about geoms and aesthetics in the following sections. 
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FIGURE 4.6 

The anatomy of a 
graph 


Geometric objects (geoms) © 


There are a variety of geom functions that determine what kind of geometric object is 
printed on a layer. Here is a list of a few of the more common ones that you might use (for 
a full list see the ggplot2 website http://had.co.nz/ggplot2/): 

• geom_bar() : creates a layer with bars representing different statistical properties. 

• geom_point(): creates a layer showing the data points (as you would see on a 
scatterplot). 

• geom _line(): creates a layer that connects data points with a straight line. 

• geom_smooth(): creates a layer that contains a ‘smoother’ (i.e., a line that summarizes 
the data as a whole rather than connecting individual data points). 

• geom_histogram(): creates a layer with a histogram on it. 

• geom_boxplot(): creates a layer with a box-whisker diagram. 

• geom_text() : creates a layer with text on it. 

• geom_density(): creates a layer with a density plot on it. 

• geom_errorbar() : creates a layer with error bars displayed on it. 

• geom_hline() and geom_vline(): create a layer with a user-defined horizontal or verti¬ 
cal line, respectively. 

Notice that each geom is followed by ‘O’, which means that it can accept aesthetics that 
specify how the layer looks. Some of these aesthetics are required and others are optional. 
For example, if you want to use the text geom then you have to specify the text that you 
want to print and the position at which you want to print it (using x and y coordinates), 
but you do not have to specify its colour. 

In terms of required aesthetics, the bare minimum is that each geom needs you to specify 
the variable or variables that the geom represents. It should be self-evident that ggplot2 
can’t create the geom without knowing what it is you want to plot! Optional aesthetics 
take on default values but you can override these defaults by specifying a value. These are 
attributes of the geom such as the colour of the geom, the colour to fill the geom, the type 
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Table 4.1 Aesthetic properties associated with some commonly used geoms 



Required 

Optional 

geom_bar() 

x: the variable to plot on the x-axis 

colour 

size 

fill 

linetype 

weight 

alpha 

geom _point() 

x: the variable to plot on the x-axis 

shape 


y: the variable to plot on the y-axis 

colour 

size 

fill 

alpha 

geomJineQ 

x: the variable to plot on the x-axis 

colour 


y: the variable to plot on the y-axis 

size 

linetype 

alpha 

geom_smooth() 

x: the variable to plot on the x-axis 

colour 


y: the variable to plot on the y-axis 

size 

fill 

linetype 

weight 

alpha 

geom_histogram() 

x: the variable to plot on the x-axis 

colour 

size 

fill 

linetype 

weight 

alpha 

geom_boxplot() 

x: the variable to plot 

colour 


ymin: lower limit of ‘whisker’ 

size 


ymax: upper limit of ‘whisker’ 

fill 


lower: lower limit of the ‘box’ 

weight 


upper: upper limit of the ‘box’ 
middle: the median 

alpha 

geomJextQ 

x: the horizontal coordinate of where the text 

colour 


should be placed 

size 


y: the vertical coordinate of where the text should 

angle 


be placed 

hjust (horizontal 


label: the text to be printed 

adjustment) 


all of these can be single values or variables 

vjust (vertical 


containing coordinates and labels for multiple 

adjustment) 


items 

alpha 

geom_density() 

x: the variable to plot on the x-axis 

colour 


y: the variable to plot on the y-axis 

size 

fill 

linetype 

weight 

alpha 
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Table 4.1 (Continued) 



Required 

Optional 

geom_errorbar() 

x: the variable to plot 

colour 


ymin, ymax: lower and upper value of error bar 

size 



linetype 



width 



alpha 

geom_hline(), 

yintercept = value 

colour 

geom_vline() 

xintercept = value 

size 


(where value is the position on the x- or y-axis 

linetype 


where you want the vertical/horizontal line) 

alpha 


of line to use (solid, dashed, etc.), the shape of the data point (triangle, square, etc.), the 
size of the geom, and the transparency of the geom (known as alpha). Table 4.1 lists some 
common geoms and their required and optional aesthetic properties. Note that many of 
these aesthetics are common across geoms: for example, alpha, colour, linetype and fill can 
be specified for most of geoms listed in the table. 


Aesthetics © 


We have already seen that aesthetics control the appearance of elements within a geom or 
layer. As already mentioned, you can specify aesthetics for the plot as a whole (such as the 
variables to be plotted, the colour, shape, etc.) and these instructions will filter down to 
any geoms in the plot. However, you can also specify aesthetics for individual geoms/layers 
and these instructions will override those of the plot as a whole. It is efficient, therefore, to 
specify things like the data to be plotted when you create the plot (because most of the time 
you won’t want to plot different data for different geoms) but to specify idiosyncratic fea¬ 
tures of the geom’s appearance within the geom itself. Hopefully, this process will become 
clear in the next section. 

For now, we will simply look at bow to specify aesthetics in a general sense. Figure 4.7 
shows the ways in which aesthetics are specified. First, aesthetics can be set to a specific value 
(e.g., a colour such as red) or can be set to vary as a function of a variable (e.g., displaying 
data for different experimental groups in different colours). If you want to set an aesthetic to 
a specific value then you don’t specify it within the aes() function, but if you want an aesthetic 



FIGURE 4.7 

Specifying 
aesthetics in 
ggplot2 
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to vary then you need to place the instruction within aes(). Finally, you can set both specific 
and variable aesthetics at the layer or geom level of the plot, but you cannot set specific values 
at the plot level. In other words, if we want to set a specific value of an aesthetic we must do 
it within the geom() that we’re using to create the particular layer of the plot. 

Table 4.2 lists the main aesthetics and how to specify each one. There are, of course, 
others, and I can’t cover the entire array of different aesthetics, but I hope to give you an 
idea of how to change some of the more common attributes that people typically want to 
change. For a comprehensive guide read Fladley Wickham’s book (Wickham, 2009). It 
should be clear from Table 4.2 that most aesthetics are specified simply by writing the name 
of the aesthetic, followed by an equals sign, and then something that sets the value: this can 
be a variable (e.g., colour = gender, which would produce different coloured aesthetics for 
males and females) or a specific value (e.g., colour = “Red”). 


Table 4.2 Specifying optional aesthetics 


Aesthetic 

Option 

Outcome 

Linetype 

linetype = 1 

Solid line (default) 


linetype = 2 

Hashed 


linetype = 3 

Dotted 


linetype = 4 

Dot and hash 


linetype = 5 

Long hash 


linetype = 6 

Dot and long hash 

Size 

size = value 

Replace ‘value’ with a value in mm (default size = 0.5). 

Larger values than 0.5 give you fatter lines/larger text/bigger 
points than the default whereas smaller values will produce 
thinner lines/smaller text and points than the default. 


e.g., size = 0.25 

Produces lines/points/text of 0.25mm 

Shape 

shape = integer, 
shape = “x” 

The integer is a value between 0 and 25, each of which 
specifies a particular shape. Some common examples are 
below. Alternatively, specify a single character in quotes to 
use that character (shape = “A” will plot each point as the 
letter A). 


shape = 0 

Hollow square (15 is a filled square) 


shape = 1 

Hollow circle (16 is a filled circle) 


shape = 2 

Hollow triangle (17 is a filled triangle) 


shape = 3 

' + ’ 


shape = 5 

Hollow rhombus (18 for filled) 


shape = 6 

Hollow inverted triangle 

Colour 

colour = “Name” 

Simply type the name of a standard colour. For example, 
colour = “Red” will make the geom red. 


colour = 
“#RRGGBB” 

Specify exact colours using the RRGGBB system. For 
example, colour = “#3366FF” produces a shade of blue, 
whereas colour = “#336633” produces a dark green. 

Alpha 

alpha(colour, value) 

Colours can be made transparent by specifying alpha, which 
can range from 0 (fully transparent) to 1 (fully opaque). For 
example, alpha(“Red”, 0.5) will produce a half transparent 
red. 
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The anatomy of the ggplotQ function © 


The command structure in ggplot2 follows the anatomy of the plot described above in a 
very literal way. You begin by creating an object that specifies the plot. You can, at this stage 
set any aesthetic properties that you want to apply to all layers (and geoms) within the plot. 
Therefore, it is customary to define the variables that you want to plot at this top level. A 
general version of the command might look like this: 

myGraph <- ggplot(myData, aes(variable for x axis, variable for y axis)) 

In this example, we have created a new graph object called myGraph, we have told ggplot 
to use the dataframe called myData, and we put the names of the variables to be plotted on 
the x (horizontal) and y (vertical) axis within the aes() function. Not to labour the point, 
but we could also set other aesthetic values at this top level. As a simple example, if we 
wanted our layers/geoms to display data from males and females in different colours then 
we could specify (assuming the variable gender defines whether a datum came from a man 
or a woman): 

myGraph <- ggplotfmyData, aes(variable for x axis, variable for y axis, 
colour = gender)) 

In doing so any subsequent geom that we define will take on the aesthetic of producing 
different colours for males and females, assuming that this is a valid aesthetic for the par¬ 
ticular geom (if not, the colour specification is ignored) and that we don’t override it by 
defining a different colour aesthetic within the geom itself. 

At this level you can also define options using the opts() function. The most common 
option to set at this level is a title: 

+ optsCtitle = "Title") 

Whatever text you put in the quotations will appear as your title exactly as you have typed 
it, so punctuate and capitalize appropriately. 

So far we have created only the graph object: there are no graphical elements, and if you 
try to display myGraph you’ll get an error. We need to add layers to the graph containing 
geoms or other elements such as labels. To add a layer we literally use the ‘add’ symbol ( + ). 
So, let’s assume we want to add bars to the plot, we can execute this command: 

myGraph + geom_bar() 

This command takes the object myGraph that we have already created, and adds a layer 
containing bars to it. Now that there are graphical elements, ggplot2 will print the graph to 
a window on your screen. If we want to also add points representing the data to this graph 
then we add ‘+ geom_point()’ to the command and rerun it: 

myGraph + geom_bar() + geom_point() 

As you can see, every time you use a * + ’ you add a layer to the graph, so the above example 
now has two layers: bars and points. You can add any or all of the geoms that we have 
already described to build up your graph layer by layer. Whenever we specify a geom we 
can define an aesthetic for it that overrides any aesthetic setting for the plot as a whole. So, 
let’s say we have defined a new graph as: 

myGraph <- ggplot(myData, aes(variable for x axis, variable for y axis, 
colour = gender)) 

but we want to add points that are blue (and do not vary by gender), then we can do 
this as: 

myGraph + geom_point(colour = "Blue") 
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Note that because we’ve set a specific value we have not used the aes() function to set the 
colour. If we wanted our points to be blue triangles then we could simply add the shape 
command into the geom specification too: 

myGraph + geom_point(shape = 17, colour =" "Blue") 

We can also add a layer containing things other than geoms. For example, axis labels can 
be added by using the labels() function: 

myGraph + geom_bar() + geom_point() + labels(x = "Text", y = "Text") 

in which you replace the word “Text” (again keep the quotations) with the label that you 
want. You can also apply themes, faceting and options in a similar manner (see sections 
4.4.6 and 4.10). 


Stats and geoms <D 


We have already encountered various geoms that map onto common plots used in research: 
geom_histogram, geomjboxplot, geom_smooth, geom_bar etc. (see Table 4.1). At face value 
it seems as though these geoms require you to generate the data necessary to plot them. For 
example, the boxplot geom requires that you tell it the minimum and maximum values of 
the box and the whiskers as well as the median. Similarly, the errorbar geom requires you 
to feed in the minimum and maximum values of the bars. Entering the values that the geom 
needs is certainly an option, but more often than not you’ll want to just create plots directly 
from the raw data without having to faff about computing summary statistics. Luckily, 
ggplotl has some built-in functions called ‘stats’ that can be either used by a geom to get 
the necessary values to plot, or used directly to create visual elements on a layer of a plot. 

Table 4.3 shows a selection of stats that geoms use to generate plots. I have focused only 
on the stats that will actually be used in this book, but there are others (for a full list see 
http://had.co.nz/ggplot2/). Mostly, these stats work behind the scenes: a geom uses them 
without you knowing about it. However, it’s worth knowing about them because they 
enable you to adjust the properties of a plot. For example, imagine we want to plot a his¬ 
togram, we can set up our plot object {myHistogram) as: 

myHistogram <- ggplotfmyData, aes(variable)) 

which has been defined as plotting the variable called variable from the dataframe myData. 
As we saw in the previous section, if we want a histogram, then we simply add a layer to 
the plot using the histogram geom: 

myHistogram + geom_histogram() 

That’s it: a histogram will magically appear. However, behind the scenes the histogram 
geom is using the bin stat to generate the necessary data (i.e., to bin the data). We could get 
exactly the same histogram by writing: 

myHistogram + geom_histogram(aes(y = ..count..)) 

The aes(y = ..count..) is simply telling geom_histogram to set the y-axis to be the count 
output variable from the bin stat, which geom_histogram will do by default. As we can see 
from Table 4.3, there are other variables we could use though. Let’s say we wanted our 
histogram to show the density rather than the count. Then we can’t rely on the defaults 
and we would have to specify that geom_histogram plots the density output variable from 
the bin stat on the y-axis: 

myHistogram + geom_histogram(aes(y = ..density..)) 



CHAPTER 4 EXPLORING DATA WITH GRAPHS 


129 


Table 4.3 Some of the built-in ‘stats' in ggplot2 


Stat 

Function 

Output Variables 

Useful Parameters 

Associated 

Geom 

bin 

Bins data 

count: number of 
points in bin 
density: density of 
points in bin, scaled to 
integrate to 1 
ncount: count, scaled 
to maximum of 1 
ndensity: density, 
scaled to maximum of f 

binwidth: bin width 

breaks: override bin width 
with specific breaks to use 
width: width of bars 

histogram 

boxplot 

Computes 
the data 
necessary to 
plot a boxplot 

width: width of boxplot 
ymin: lower whisker 
lower: lower hinge, 

25% quantile 
middle: median 
upper: upper hinge, 
75% quantile 
ymax: upper whisker 


boxplot 

density 

Density 

estimation 

density: density 
estimate 
count: density x 
number of points 
scaled: density 
estimate, scaled to 
maximum of 1 


density 

qq 

Compute 
data for Q-Q 
plots 

sample: sample 
quantiles 

theoretical: theoretical 
quantiles 

quantiles 

point 

smooth 

Create a 

smoother 

plot 

y: predicted value 
ymin: lower pointwise 

Cl around the mean 
ymax: upper pointwise 
Cl around the mean 

se: standard error 

method: e.g., Im, glm, 
gam, loess 
formula: formula for 
smoothing 

se: display Cl (true by 
default) 

level: level of Cl to use 
(0.95 by default) 

smooth 

summary 

Summarize 

data 


fun.y: determines the 
function to plot on the 
y-axis (e.g., fun.y = mean) 

bar, errorbar, 

pointrange, 
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Similarly, by default, geom_histogram uses a bin width of the range of scores divided by 30. 
We can use the parameters of the bin stat to override this default: 

myHistogram + geom_histogram(aes(y = ..count..), binwidth = 0.4) 

As such, it is helpful to have in mind the relationship between geoms and stats when plot¬ 
ting graphs. As we go through the chapter you will see how stats can be used to control what 
is produced by a geom, but also how stats can be used directly to make a layer of a plot. 
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Avoiding overplotting (D 


Plots can become cluttered or unclear because (1) there is too much data to present in a 
single plot, and (2) data on a plot overlap. There are several positioning tools in ggplot2 
that help us to overcome these problems. The first is a position adjustment, defined very 
simply as: 

position = "x" 

in which x is one of five words: 

• dodge: positions objects so that there is no overlap at the side. 

• stack and fill: positions objects so that they are stacked. The stack instruction stacks 
objects in front of each other such that the largest object is at the back and smallest 
at the front. The fill instruction stacks objects on top of one other (to make up stacks 
of equal height that are partitioned by the stacking variable). 

• identity: no position adjustment. 

• jitter: adds a random offset to objects so that they don’t overlap. 

Another useful tool for avoiding overplotting is faceting, which basically means splitting 
a plot into subgroups. There are two ways to do this. The first is to produce a grid that splits 
the data displayed by the plot by combinations of other variables. This is achieved using 
facet_grid(). The second way is to split the data displayed by the plot by a single variable 
either as a long ribbon of individual graphs, or to wrap the ribbon onto the next line after 
a certain number of plots such that a grid is formed. This is achieved using facet jvrapQ. 

Figure 4.8 shows the differences between facet_grid() and facet_wrap() using a con¬ 
crete example. Social networking sites such as Facebook offer an unusual opportunity to 
carefully manage your self-presentation to others (i.e., do you want to appear to be cool 
when in fact you write statistics books, appear attractive when you have huge pustules all 
over your face, fashionable when you wear 1980s heavy metal band t-shirts and so on). 


FIGURE 4.8 
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A study was done that examined the relationship between narcissism and other people’s 
ratings of your profile picture on Facebook (Ong et ah, 2011). The pictures were rated 
on each of four dimensions: coolness, glamour, fashionableness and attractiveness. In 
addition, each person was measuresd on introversion/extroversion and also their gen¬ 
der recorded. Let’s say we wanted to plot the relationship between narcissism and the 
profile picture ratings. We would have a lot of data because we have different types of 
rating, males and females and introverts and extroverts. We could use facet_grid() to 
produce plots of narcissism vs. photo rating for each combination of gender and extro¬ 
version. We’d end up with a grid of four plots (Figure 4.8). Alternatively, we could use 
facet_wrap() to split the plots by the type of rating (cool, glamorous, fashionable, attrac¬ 
tive). Depending on how we set up the command, this would give us a ribbon of four 
plots (one for each type of rating) or we could wrap the plots to create a grid formation 
(Figure 4.8). 

To use faceting in a plot, we add one of the following commands: 

+ facet_wrap( ~ y, nrow = integer, ncol = integer) 

+ facet_grid(x ~ y) 

In these commands, x and y are the variables by which you want to facet, and for facet_ 
wrap nrow and ncol are optional instructions to control how the graphs are wrapped: they 
enable you to specify (as an integer) the number of rows or columns that you would like. 
For example, if we wanted to facet by the variables gender and extroversion, we would add 
this command: 

facet_grid(gender ~ extroversion) 

If we wanted to draw different graphs for the four kinds of rating (Rating_Type), we could 
add: 

+ facet_wrap( ~ Rdting_Type) 

This would give us an arrangement of graphs of one row and four columns (Figure 4.8); 
if we wanted to arrange these in a 2 by 2 grid (Figure 4.8) then we simply specify that we 
want two columns: 

+ fdcet_wrnp( ~ Rdting_Type, ncol = 2) 
or, indeed, two rows: 

+ focet_wrop( ~ Roting_Type, nrow = 2) 


Saving graphs © 


Having created the graph of your dreams, you’ll probably want to save it somewhere. 
There are lots of options here. The simplest (but least useful in my view) is to use the File 
menu to save the plot as a pdf file. Figure 4.9 shows the stages in creating and saving a 
graph. Like anything in R, you first write a set of instructions to generate the graph. You 
select and execute these instructions. Having done this your graph appears in a new win¬ 
dow. Click inside this window to make it active, then go to the File=>Save As menu to open 
a standard dialog box to save the file in a location of your choice. 

Personally, I prefer to use the ggsave() function, which is a versatile exporting function 
that can export as PostScript (.eps/.ps), tex (pictex), pdf, jpeg, tiff, png, bmp, svg and wmf 
(in Windows only). In its basic form, the structure of the function is very simple: 

ggsdve(filendme) 
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FIGURE 4.9 

Saving a graph 
manually 



Here filename should be a text string that defines where you want to save the plot and the 
filename you want to use. The function automatically determines the format to which you 
want to export from the file extension that you put in the filename, so: 

ggsave("Outlier Amazon.png") 

will export as a png file, whereas 

ggsave("Outlier Amazon.tiff") 

will export as a tiff file. In the above examples I have specified only a filename, and these 
files will, therefore be saved in the current working directory (see section 3.4.4). You can, 
however, use a text string that defines an exact location, or create an object containing the 
file location that is then passed into the ggsave() function (see R’s Souls’ Tip 4.1). There are 
several other options you can specify, but mostly the defaults are fine. However, sometimes 
you might want to export to a specific size, and this can be done by defining the width and 
height of the image in inches: thus 

ggsave("Outlier Amazon.tiff", width = 2, height = 2) 
should save a tiff file that is 2 inches wide by 2 inches high. 


4 . 4 . 8 . 


Putting it all together: a quick tutorial (D 


We have covered an enormous amount of ground in a short time, and have still only 
scratched the surface of what can be done with ggplot2. Also, we haven’t actually plotted 
anything yet! In this section we will do a quick tutorial in which we put into practice vari¬ 
ous things that we have discussed in this chapter to give you some concrete experience of 
using ggplotl and to illustrate how some of the basic functionality the package works. 
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Saving graphs (3) 


By default ggsave() saves plots in your working directory (which hopefully you have set as something sensible). I 
find it useful sometimes to set up a specific location for saving images and to feed this into the ggsave() function. 
For example, executing: 


imageDirectory<-file.path(Sys.getenv("HOME"), "Documents", "Academic", "Books", 
"Discovering Statistics", "DSUR I Images") 


uses the file.path() and Sys.getenv() functions to create an object imageDirectory which is a text string defining 
a folder called ‘DSUR I Images', which is in a folder called ‘Discovering Statistics’ in a folder called ‘Books’ in a 
folder called ‘Academic’ which is in my main 'Documents’ folder. On my computer (an iMac) this command sets 
imageDirectory to be: 

"/Users/andyfield/Documents/Academic/Books/Discovering Statistics/DSUR I Images" 


Sys.getenv(“HOME’’) is a quick way to get the filepath of your home directory (in my case /Users/andy- 
field/), and we use the file.path() function to paste the specified folder names together in an intelligent 
way based on the operating system that you use. Because I use a Mac it has connected the folders using 
an ’/’, but if I used Windows it would have used ‘W instead (because this is the symbol Windows uses to 
denote folders). 

Having defined this location, we can use it to create a file path for a new image: 

imageFile <- file.path(imageDirectory,"Graph.png") 
ggsave(imageFile) 


This produces a text string called imageFile, which is the filepath we have just defined (imageDirectory) with the 
filename that we want (Graph.png) added to it. We can reuse this code for a new graph by just changing the 
filename specified in imageFile: 

imageFile <- file.path(imageDirectory,"Outlier Amazon.png") 
ggsave(imageFile) 


Earlier in the chapter we mentioned a study that looked at ratings of Facebook profile 
pictures (rated on coolness, fashion, attractiveness and glamour) and predicting them from 
how highly the person posting the picture scores on narcissism (Ong et ah, 2011). The data 
are in the file FacebookNarcissism.dat. 

First set your working directory to be the location of the data file (see section 3.4.4). 
Then create a dataframe called facebookData by executing the following command: 

facebookData <- read.delim("FacebookNarcissism.dat", header = TRUE) 

Figure 4.10 shows the contents of the dataframe. There are four variables: 

1 id: a number indicating from which participant the profile photo came. 

2 NPQC R Total: the total score on the narcissism questionnaire. 

3 Rating_Type: whether the rating was for coolness, glamour, fashion or attractiveness 
(stored as strings of text). 

4 Rating: the rating given (on a scale from 1 to 5). 
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FIGURE 4.10 
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First we need to create the plot object, which I have called, for want of a more original 
idea, graph. Remember that we initiate this object using the ggplot() function, which takes 
the following form: 

graph <- ggplotfmyData, aes(variable for x axis, variable for y axis)) 

To begin with, let’s plot the relationship between narcissism (NPQC_R_Total) and the 
profile ratings generally (Rating). As such, we want NPQC_R_Total plotted on the x-axis 
and Rating on the y-axis. The dataframe containing these variables is called facebookData 
so we type and execute this command: 

graph <- ggplotffacebookData, aes(NPQC_R_Total, Rating)) 

This command simply creates an object based on the facebookData dataframe and speci¬ 
fies the aesthetic mapping of variables to the x- and y-axes. Note that these mappings are 
contained within the aes() function. When you execute this command nothing will happen: 
we have created the object, but there is nothing to print. 

If we want to see something then we need to take our object (graph) and add some visual 
elements. Let’s start with something simple and add dots for each data point. This is done 
using the geom_point() function. If you execute the following command you’ll see the 
graph in the top left panel of Figure 4.11 appear in a window on your screen: 

graph + geom_point() 

If we don’t like the circles then we can change the shape of the points by specifying this for 
the geom. For example, executing: 

graph + geom_point(shape = 17) 

will change the dots to triangles (top right panel of Figure 4.11). By changing the number 
assigned to shape to other values you will see different shaped points (see section 4.4.3). 
If we want to change the size of the dots rather than the shape, this is easily done too by 
specifying a value (in mm) that you want to use for the ‘size’ aesthetic. Executing: 

graph + geom_point(size = 6) 
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creates the graph in the middle left panel of Figure 4.11. Note that the default shape has 
been used (because we haven’t specified otherwise), but the size is larger than by default. At 
this stage we don’t know whether a rating represented coolness, attractiveness or whatever. 
It would be nice if we could differentiate different ratings, perhaps by plotting them in 
different colours. We can do this by setting the colour aesthetic to be the variable Rating_ 
Type. Executing this command: 

graph + geom_pointCaes(colour = Rating_Type)) 

creates the graph in the middle right panel of Figure 4.11, in which, onscreen, different 
types of ratings are now presented in different colours. 5 

5 Note that here we set the colour aesthetic by enclosing it in aes() whereas in the previous examples we did not. 
This is because we’re setting the value of colour based on a variable, rather than a single value. 
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We potentially have a problem of overplotting because there were a limited number of 
responses that people could give (notice that the data points fall along horizontal lines that 
represent each of the five possible ratings). To avoid this overplotting we could use the 
position option to add jitter: 

graph + geom_pointCaes(colour = Rating_Type), position = "jitter") 

Notice that the command is the same as before; we have just added position = “jitter”. 
The results are shown in the bottom left panel of Figure 4.11; the dots are no longer in 
horizontal lines because a random value has been added to them to spread them around 
the actual value. It should be clear that many of the data points were sitting on top of each 
other in the previous plot. 

Finally, if we wanted to differentiate rating types by their shape rather than using a 
colour, we could change the colour aesthetic to be the shape aesthetic: 

graph + geom_pointCaes(shape = Rating_Type), position = "jitter") 

Note how we have literally just changed colour = Rating_Type to shape = Rating_Type. 
The resulting graph in the bottom right panel of Figure 4.11 is the same as before except 
that the different types of ratings are now displayed using different shapes rather than dif¬ 
ferent colours. 

This very rapid tutorial has hopefully demonstrated how geoms and aesthetics work 
together to create graphs. As we now turn to look at specific kinds of graphs, you should 
hopefully have everything you need to make sense of how these graphs are created. 


4.5. Graphing relationships: the scatterplot © 



Sometimes we need to look at the relationships between variables. A scat¬ 
terplot is a graph that plots each person’s score on one variable against their 
score on another. A scatterplot tells us several things about the data, such 
as whether there seems to be a relationship between the variables, what 
kind of relationship it is and whether any cases are markedly different from 
the others. We saw earlier that a case that differs substantially from the 
general trend of the data is known as an outlier and such cases can severely 
bias statistical procedures (see Jane Superbrain Box 4.1 and section 7.7.1.1 
for more detail). We can use a scatterplot to show us if any cases look like 
outliers. 


Simple scatterplot © 



This type of scatterplot is for looking at just two variables. For example, a psychologist 
was interested in the effects of exam stress on exam performance. So, she devised and 
validated a questionnaire to assess state anxiety relating to exams (called the Exam Anxiety 
Questionnaire, or EAQ). This scale produced a measure of anxiety scored out of 100. 
Anxiety was measured before an exam, and the percentage mark of each student on the 
exam was used to assess the exam performance. The first thing that the psychologist should 
do is draw a scatterplot of the two variables. Her data are in the file ExamAnxiety.dat and 
you should load this file into a dataframe called examData by executing: 


examData <- read.delim("Exam Anxiety.dat", header = TRUE) 
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Figure 4.12 shows the contents of the dataframe. There are five variables: 

1 Code: a number indicating from which participant the scores came. 

2 Revise: the total hours spent revising. 

3 Exam: mark on the exam as a percentage. 

4 Anxiety: the score on the EAQ. 

5 Gender: whether the participant was male or female (stored as strings of text). 
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First we need to create the plot object, which I have called scatter. Remember that we 
initiate this object using the ggplot() function. The contents of this function specify the 
dataframe to be used ( examData ) and any aesthetics that apply to the whole plot. I’ve said 
before that one aesthetic that is usually defined at this level is the variables that we want to 
plot. To begin with, let’s plot the relationship between exam anxiety (Anxiety) and exam 
performance (Exam). We want Anxiety plotted on the x-axis and Exam on the y-axis. 
Therefore, to specify these variables as an aesthetic we type aes(Anxiety, Exam). Therefore, 
the final command that we execute is: 

scatter <- ggplot(examData, aes(Anxiety, Exam)) 

This command creates an object based on the examData dataframe and specifies the aes¬ 
thetic mapping of variables to the x- and y-axes. When you execute this command nothing 
will happen: we have created the object, but there is nothing to print. 

If we want to see something then we need to take our object ( scatter ) and add a layer 
containing visual elements. For a scatterplot we essentially want to add dots, which is done 
using the geom_point() function. 

scatter + geom_point() 

If we want to add some nice labels to our axes then we can also add a layer with these 
on using labs(): 

scatter + geom_point() + labs(x = "Exam Anxiety", y = "Exam 
Performance %") 
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FIGURE 4.13 
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If you execute this command you’ll see the graph in Figure 4.13. The scatterplot tells us 
that the majority of students suffered from high levels of anxiety (there are very few cases 
that had anxiety levels below 60). Also, there are no obvious outliers in that most points 
seem to fall within the vicinity of other points. There also seems to be some general trend 
in the data, such that low levels of anxiety are almost always associated with high examina¬ 
tion marks (and high anxiety is associated with a lot of variability in exam marks). Another 
noticeable trend in these data is that there were no cases having low anxiety and low exam 
performance - in fact, most of the data are clustered in the upper region of the anxiety scale. 


Adding a funky line © 


You often see scatterplots that have a line superimposed over the top that summarizes the 
relationship between variables (this is called a regression line and we will discover more 
about it in Chapter 7). The scatterplot you have just produced won’t have a funky line on 
it yet, but don’t get too depressed because I’m going to show you how to add this line now. 

lnggplot2 terminology a regression line is known as a ‘smoother’ because it 
smooths out the lumps and bumps of the raw data into a line that summarizes 
the relationship. The geom_smooth() function provides the functionality to 
add lines (curved or straight) to summarize the pattern within your data. 

To add a smoother to our existing scatterplot, we would simply add the 
geom_smooth() function and execute it: 

scatter + geom_point() + geom_smoothQ + labs(x = "Exam Anxiety", 
y = "Exam Performance %") 

Note that the command is exactly the same as before except that we have 
added a smoother in a new layer by typing + geom_smooth(). The resulting 
graph is shown in Figure 4.14. Note that the scatterplot now has a curved 
line (a ‘smoother’) summarizing the relationship between exam anxiety and 
exam performance. The shaded area around the line is the 95% confidence interval around 
the line. We’ll see in due course how to remove this shaded error or to recolour it. 

The smoothed line in Figure 4.14 is very pretty, but often we want to fit a straight line 
(or linear model) instead of a curved one. To do this, we need to change the ‘method’ 
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associated with the smooth geom. In Table 4.3 we saw several methods that could be used 
for the smooth geom: Im fits a linear model (i.e., a straight line) and you could use rim for 
a robust linear model (i.e., less affected by outliers). 6 So, to add a straight line (rather than 
curved) we change geom_smooth() to include this instruction: 

+ geom_smooth(method = "1m") 

We can also change the appearance of the line: by default it is blue, but if we wanted a red 
line then we can simply define this aesthetic within the geom: 

+ geom_smooth(method = "1m", colour = "Red") 

Putting this together with the code for the simple scatterplot, we would execute: 
scatter <- ggplot(examData, aes(Anxiety, Exam)) 

scatter + geom_point() + geom_smooth(method = "1m", colour = "Red")+ labs(x 
= "Exam Anxiety", y = "Exam Performance %") 

The resulting scatterplot is shown in Figure 4.15. Note that it looks the same as Figure 
4.13 and Figure 4.14 except that a red (because we specified the colour as red) regression 
line has been added. 7 As with our curved line, the regression line is surrounded by the 95% 
confidence interval (the grey area). We can switch this off by simply adding se = F (which 
is short for ‘standard error = False’) to the geom_smooth() function: 

+ geom_smooth(method = "1m", se = F) 

We can also change the colour and transparency of the confidence interval using the fill and 
alpha aesthetics, respectively. For example, if we want the confidence interval to be blue 
like the line itself, and we want it fairly transparent we could specify: 

geom_smooth(method = "1m", alpha = 0.1, fill = "Blue") 


6 You must have the MASS package loaded to use this method. 

7 You’ll notice that the figure doesn’t have a red line but what you see on your screen does, that’s because this 
book isn’t printed in colour which makes it tricky for us to show you the colourful delights of R. In general, use 
the figures in the book as a guide only and read the text with reference to what you actually see on your screen. 
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FIGURE 4.15 
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Note that transparency can take a value from 0 (fully transparent) to 1 (fully opaque) 
and so we have set a fairly transparent colour by using 0.1 (after all we want to see the data 
points underneath). The impact of these changes can be seen in Figure 4.16. 


Grouped scatterplot © 


What if we want to see whether male and female students had different reactions to exam 
anxiety? To do this, we need to set Gender as an aesthetic. This is fairly straightforward. 
First, we define gender as a colour aesthetic when we initiate the plot object: 

scatter <- ggplot(examData, aes(Anxiety, Exam, colour = Gender)) 
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Note that this command is exactly the same as the previous example, except that we have 
added ‘colour = Gender’ so that any geoms we define will be coloured differently for men 
and women. Therefore, if we then execute: 

scatter + geom_point + geom_smooth(method = "1m") 

we would have a scatterplot with different coloured dots and regression lines for men and 
women. It’s as simple as that. However, our lines would have confidence intervals and both 
intervals would be shaded grey, so we could be a little more sophisticated and add some 
instructions into geom_smootb() that tells it to also colour the confidence intervals accord¬ 
ing to the Gender variable: 

scatter + geom_pointQ + geom_smooth(method = "1m", aes(fill = Gender), alpha 
= 0 . 1 ) 

Note that we have used fill to specify that the confidence intervals are coloured according 
to Gender (note that because we are specifying a variable rather than a single colour we 
have to place this option within aes()). As before, we have also manually set the transpar¬ 
ency of the confidence intervals to be 0.1. 

As ever, let’s add some labels to the graph: 

+ labs(x = "Exam Anxiety", y = "Exam Performance colour = "Gender") 

Note that by specifying a label for ‘colour’ I am setting the label that will be used on the 
legend of the graph. The finished command to be executed will be: 

scatter + geom_point() + geom_smooth(method = "1m", aes(fill = Gender), alpha 
= 0.1) + labs(x = "Exam Anxiety", y = "Exam Performance %", colour = "Gender") 

Figure 4.17 shows the resulting scatterplot. The regression lines tell us that the relation¬ 
ship between exam anxiety and exam performance was slightly stronger in males (the line 
is steeper) indicating that men’s exam performance was more adversely affected by anxiety 
than women’s exam anxiety. (Whether this difference is significant is another issue - see 
section 6.7.1.) 



FIGURE 4.17 

Scatterplot of 
exam anxiety 
and exam 
performance 
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SELF-TEST 

s Go back to the Facebook narcissism data from the 
earlier tutorial. Plot a graph that shows the pattern in 
the data using only a line. 

s Plot different coloured lines for the different types of 
rating (cool, fashionable, attractive, glamorous). 
s Add a layer displaying the raw data as points. 
s Add labels to the axes. 


4.6. Histograms: a good way to spot obvious 
problems © 



In this section we’ll look at how we can use frequency distributions to screen our data. 8 
We’ll use an example to illustrate what to do. A biologist was worried about the poten¬ 
tial health effects of music festivals. So, one year she went to the Download Music 
Festival 9 (for those of you outside the UK, you can pretend it is Roskilde Festival, Ozzfest, 
Lollopalooza, Wacken or something) and measured the hygiene of 810 concert-goers over 
the three days of the festival. In theory each person was measured on each day but because 
it was difficult to track people down, there were some missing data on days 2 and 3. 
Hygiene was measured using a standardized technique (don’t worry, it wasn’t licking the 
person’s armpit) that results in a score ranging between 0 (you smell like a corpse that’s 
been left to rot up a skunk’s arse) and 4 (you smell of sweet roses on a fresh spring day). 
Now I know from bitter experience that sanitation is not always great at these places (the 
Reading Festival seems particularly bad) and so this researcher predicted that personal 
hygiene would go down dramatically over the three days of the festival. The data file, 
DownloadFestival.dat, can be found on the companion website. We encountered histo¬ 
grams (frequency distributions) in Chapter 1; we will now learn how to create one in R 
using these data. 



SELF-TEST 

s What does a histogram show? 


Load the data into a dataframe (which I’ve called festivalData); if you need to refresh 
your memory on data files and dataframes see section 3.5. Assuming you have set the 
working directory to be where the data file is stored, you can create the dataframe by 
executing this command: 

festivalData <- read.delim("DownloadFestival.dat", header = TRUE) 

Now we need to create the plot object and define any aesthetics that apply to the plot as 
a whole. I have called the object festivalHistogram, and have created it using the ggplotQ 


8 An alternative way to graph the distribution is a density plot, which we’ll discuss later. 

9 http://www.downloadfestival.co.uk 
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function. The contents of this function specify the dataframe to be used ( festivalData ) and 
any aesthetics that apply to the whole plot. I’ve said before that one aesthetic that is usu¬ 
ally defined at this level is the variables that we want to plot. To begin with let’s plot the 
hygiene scores for day 1, which are in the variable dayl. Therefore, to specify this variable 
as an aesthetic we type aes(dayl). I have also decided to turn the legend off so I have added 
opts(legend.position = “none”) to do this (see R’s Souls’ Tip 4.2): 

festivalHistogram <- ggplot(festivalData, aes(dayl)) + optsCLegend.position 
= "none") 

Remember that having executed the above command we have an object but no graphi¬ 
cal layers, so we will see nothing. To add the graphical layer we need to add the histogram 
geom to our existing plot: 

festivalHistogram + geom_histogramO 

Executing this command will create a graph in a new window. If you are happy using the 
default options then this is all there is to it; sit back and admire your efforts. However, we 
can tidy the graph up a bit. First, we could change the bin width. I would normally play 
around with different bin widths to get a feel for the distribution. To save time, let’s just 
change it to 0.4. We can do this by inserting a command within the histogram geom: 

+ geom_histogram(binwidth = 0.4) 

We should also provide more informative labels for our axes using the labs() function: 

+ labs(x = "Hygiene (Day 1 of Festival)", y = "Frequency") 

As you can see, I have simply typed in the labels I want (within quotation marks) for the 
horizontal (x) and vertical ( y ) axes. Making these two changes leaves us with this com¬ 
mand, which we must execute to see the graph: 

festivalHistogram + geom_histogram(binwidth = 0.4) + labs(x = "Hygiene (Day 
1 of Festival)", y = "Frequency") 

The resulting histogram is shown in Figure 4.18. The first thing that should leap out at 
you is that there appears to be one case that is very different than the others. All of the 


FIGURE 4.18 
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scores appear to be squashed up at one end of the distribution because they are all less 
than 5 (yielding a very pointy distribution) except for one, which has a value of 20. This is 
an outlier: a score very different than the rest (Jane Superbrain Box 4.1). Outliers bias the 
mean and inflate the standard deviation (you should have discovered this from the self-test 
tasks in Chapters 1 and 2) and screening data is an important way to detect them. You can 
look for outliers in two ways: (1) graph the data with a histogram (as we have done here) 
or a boxplot (as we will do in the next section); or (2) look at ^-scores (this is quite com¬ 
plicated, but if you want to know see Jane Superbrain Box 4.2). 

The outlier shown on the histogram is particularly odd because it has a score of 20, 
which is above the top of our scale (remember our hygiene scale ranged only from 0 to 
4) and so it must be a mistake (or the person had obsessive compulsive disorder and had 
washed themselves into a state of extreme cleanliness). 



Removing legends ® 


By default ggplot2 produces a legend on the right-hand side of the plot. Mostly this legend is a useful thing to 
have. However, there are occasions when you might like it to go away. This is achieved using the opts() function 
either when you set up the plot object, or when you add layers to the plot. To remove the legend just add: 


+ opts(legend.position="none") 


For example, either 

myGraph <- ggplotCmyData, aes(variable for x axis, variable for y axis)) + opts(legend. 
position="none") 


or 

myGraph <- ggplot(myData, aes(variable for x axis, variable for y axis)) 
myGraph + geom_point() + opts(legend.position="none") 

will produce a graph without a figure legend. 


4.7. BoxpLots (box-whisker diagrams) © 



Boxplots or box-whisker diagrams are really useful ways to display your data. 
At the centre of the plot is the median, which is surrounded by a box the top 
and bottom of which are the limits within which the middle 50% of observa¬ 
tions fall (the interquartile range). Sticking out of the top and bottom of the box 
are two whiskers that extend to one and a half times the interquartile range. 
First, we will plot some using ggplot2 and then we’ll look at what they tell us in 
more detail. In the data file of hygiene scores we also have information about 
the gender of the concert-goer. Let’s plot this information as well. To make 
our boxplot of the day 1 hygiene scores for males and females, we will need 
to set the variable Gender as an aesthetic. The simplest way to do this is just 
to specify Gender as the variable to be plotted on the %-axis, and the hygiene 
scores (dayl) to be the variable plotted on the y-axis. As such, when we initiate 
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JANE SUPERBRAIN 4.1 

What is an outlier? © 

An outlier is a score very different from the rest of the 
data. When we analyse data we have to be aware of such 
values because they bias the model we fit to the data. A 
good example of this bias can be seen by looking at the 
mean. When I published my first book (the first edition of 
the SPSS version of this book), I was quite young, I was 
very excited and I wanted everyone in the world to love 
my new creation and me. Consequently, I obsessively 
checked the book’s ratings on Amazon.co.uk. These 
ratings can range from 1 to 5 stars. Back in 2002, my 
first book had seven ratings (in the order given) of 2, 5, 
4, 5, 5, 5, and 5. All but one of these ratings are fairly 
similar (mainly 5 and 4) but the first rating was quite dif¬ 
ferent from the rest - it was a rating of 2 (a mean and 
horrible rating). The graph plots seven reviewers on the 
horizontal axis and their ratings on the vertical axis and 
there is also a horizontal line that represents the mean 
rating (4.43 as it happens). It should be clear that all of 
the scores except one lie close to this line. The score of 2 
is very different and lies some way below the mean. This 
score is an example of an outlier - a weird and unusual 


person (sorry, I mean score) that deviates from the rest of 
humanity (I mean, data set). The dashed horizontal line 
represents the mean of the scores when the outlier is not 
included (4.83). This line is higher than the original mean, 
indicating that by ignoring this score the mean increases 
(it increases by 0.4). This example shows how a single 
score, from some mean-spirited badger turd, can bias 
the mean; in this case the first rating (of 2) drags the aver¬ 
age down. In practical terms this had a bigger implication 
because Amazon rounded off to half numbers, so that 
single score made a difference between the average rat¬ 
ing reported by Amazon as a generally glowing 5 stars 
and the less impressive 4.5 stars. (Nowadays Amazon 
sensibly produces histograms of the ratings and has a 
better rounding system.) Although I am consumed with 
bitterness about this whole affair, it has at least given me 
a great example of an outlier! (Data for this example were 
taken from http://www.amazon.co.uk/ in about 2002.) 

5- • • • • • 
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our plot object rather than set a single variable as an aesthetic as we did for the histogram 
(aes(dayl)), we set Gender and dayl as variables ( aes(Gender ; dayl)). Having initiated the 
plot object (I’ve called it festivalBoxplot), we can simply add the boxplot geom as a layer 
(+ geom_boxplot()) and add some axis labels with the labs() function as we did when we 
created a histogram. To see the graph we therefore simply execute these two lines of code: 

festivalBoxplot <- ggplot(festivalData, aes(gender, dayl)) 

festivalBoxplot + geom_boxplot() + labs(x = "Gender", y = "Hygiene (Day 1 of 
Festival)") 

The resulting boxplot is shown in Figure 4.19. It shows a separate boxplot for the men 
and women in the data. Note that the outlier that we detected in the histogram is shown 
up as a point on the boxplot (we can also tell that this case was a female). An outlier is an 
extreme score, so the easiest way to find it is to sort the data: 

festivalData<-festivalData[order(festivalData$dayl),] 
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SELF-TEST 

V Remove the outlier and replot the histogram. 



JANE SUPERBRAIN 4.2 

Using z-scores to find outliers © 

To check for outliers we can look at z-scores. We saw in 
section 1.7.4 that z-scores are simply a way of standard¬ 
izing a data set by expressing the scores in terms of a 
distribution with a mean of 0 and a standard deviation of 
1. In doing so we can use benchmarks that we can apply 
to any data set (regardless of what its original mean and 
standard deviation were). To look for outliers we could 
convert our variable to z-scores and then count how many 
fall within certain important limits. If we take the absolute 
value (i.e., we ignore whether the z-score is positive or 
negative) then in a normal distribution we’d expect about 
5% to have absolute values greater than 1.96 (we often 
use 2 for convenience), and 1% to have absolute values 
greater than 2.58, and none to be greater than about 3.29. 

I have written a function that gets R to count them 
for you called outlierSummary(). To use the function you 


need to load the package associated with this book (see 
section 3.4.5), you then simply insert the name of the vari¬ 
able that you would like summarized into the function and 
execute it. For example, to count the number of z-scores 
with absolute values above our three cut-off values in the 
day2 variable, we can execute: 

outlierSummary(festivalData$day2) 

Absolute z-score greater than 1.96 = 6.82 % 

Absolute z-score greater than 2.58 = 2.27 % 

Absolute z-score greater than 3.29 = 0.76 % 

The output produced by this function is shown 
above. We would expect to see 5% (or less) with an 
absolute value greater than 1.96, 1% (or less) with an 
absolute value greater than 2.58, and we'd expect no 
cases above 3.29 (these cases are significant outliers). 
For hygiene scores on day 2 of the festival, 6.82% of 
z-scores had absolute values greater than 1.96. This is 
slightly more than the 5% we would expect in a normal 
distribution. Looking at values above 2.58, we would 
expect to find only 1 %, but again here we have a higher 
value of 2.27%. Finally, we find that 0.76% of cases 
were above 3.29 (so 0.76% are significant outliers). This 
suggests that there may be slightly too many outliers in 
this variable and we might want to do something about 
them. 



‘Graphs are for laughs, and functions are full of fun’ thinks Oliver 
as he pops a huge key up his nose and starts to wind the clock¬ 
work mechanism of his brain. We don’t look at functions for another 
couple of chapters, which is why I’ve skipped over the details of how 
the outlierSummaryO function works. If, like Oliver, you like to wind up 
your brain, the additional material for this chapter, on the companion 
website, explains how I wrote the function. If that doesn't quench your thirst for knowledge then you’re a grain of salt. 


OLIVER TWISTED 

Please, Sir, can I 
have some more... 
complicated stuff? 
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FIGURE 4.19 

Boxplot of 
hygiene scores 
on day 1 of 
the Download 
Festival split by 
gender 


This command takes festivalData and sorts it by the variable dayl. All we have to do 
now is to look at the last case (i.e., the largest value of dayl) and change it. The offending 
case turns out to be a score of 20.02, which is probably a mistyping of 2.02. We’d have to 
go back to the raw data and check. We’ll assume we’ve checked the raw data and it should 
be 2.02, and that we’ve used R Commander’s data editor (see section 3.6 or the online 
materials for this chapter) to replace the value 20.02 with the value 2.02 before we con¬ 
tinue this example. 



SELF-TEST 

s Now we have removed the outlier in the data, try 
replotting the boxplot. The resulting graph should look 
like Figure 4.20. 


Figure 4.20 shows the boxplots for the hygiene scores on day 1 after the outlier has been 
corrected. Let’s look now in more detail about what the boxplot represents. First, it shows 
us the lowest score (the lowest point of the bottom whisker, or a dot below it) and the 
highest (the highest point of the top whisker of each plot, or a dot above it). Comparing 
the males and females we can see they both had similar low scores (0, or very smelly) but 
the women had a slightly higher top score (i.e., the most fragrant female was more hygienic 
than the cleanest male). 

The lowest edge of the white box is the lower quartile (see section 1.7.3); therefore, the 
distance between the bottom of the vertical line and the lowest edge of the white box is the 
range between which the lowest 25% of scores fall. This range is slightly larger for women 
than for men, which means that if we take the most unhygienic 25% females then there is 
more variability in their hygiene scores than the lowest 25% of males. The box (the white 
area) shows the interquartile range (see section 1.7.3): that is, 50% of the scores are bigger 
than the lowest part of the white area but smaller than the top part of the white area. These 
boxes are of similar size in the males and females. 

The top edge of the white box shows the value of the upper quartile (see section 1.7.3); 
therefore, the distance between the top edge of the white box and the top of the vertical 
line shows the range between which the top 25% of scores fall. In the middle of the white 
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box is a line that represents the value of the median (see section 1.7.2). The median for 
females is higher than for males, which tells us that the middle female scored higher, or was 
more hygienic, than the middle male. 

Boxplots show us the range of scores, the range between which the middle 50% of scores 
fall, and the median, the upper quartile and lower quartile score. Like histograms, they 
also tell us whether the distribution is symmetrical or skewed. If the whiskers are the same 
length then the distribution is symmetrical (the range of the top and bottom 25% of scores 
is the same); however, if the top or bottom whisker is much longer than the opposite whis¬ 
ker then the distribution is asymmetrical (the range of the top and bottom 25% of scores 
is different). Finally, you’ll notice some dots above the male boxplot. These are cases that 
are deemed to be outliers. In Chapter 5 we’ll see what can be done about these outliers. 
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SELF-TEST 

S Produce boxplots for the day 2 and day 3 hygiene 
scores and interpret them. 


4.8. Density plots © 


Density plots are rather similar to histograms except that they smooth the distribution into 
a line (rather than bars). We can produce a density plot in exactly the same way as a his¬ 
togram, except using the density geom: geoni_density(). Assuming you have removed the 
outlier for the festival data set, 10 initiate the plot (which I have called density) in the same 
way as for the histogram: 

density <- ggp!ot(festivaIData, aes(dayl)) 


10 If you haven’t there is a data file with it removed and you can load this into a dataframe called festivalData by 
executing: 

festivalData <- read.deIim("DownIoadFestivaI(No Outlier).dat", header = TRUE) 
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Then, to get the plot simply add the density_geom() function: 
density + geom_density() 

We can also add some labels by including: 

+ labs(x = "Hygiene (Day 1 of Festival)", y = "Density Estimate") 
in the command. The resulting plot is shown in Figure 4.21. 



FIGURE 4.21 

A density plot of 
the Download 
Festival data 


4.9. Graphing means © 

Bar charts and error bars © 


Bar charts are a common way for people to display means. The ggplot2 package does not 
differentiate between research designs, so you plot bar charts in the same way regardless of 
whether you have an independent, repeated-measures or mixed design. Imagine that a film 
company director was interested in whether there was really such a thing as a ‘chick flick’ 
(a film that typically appeals to women more than men). He took 20 men and 20 women 
and showed half of each sample a film that was supposed to be a ‘chick flick’ (Bridget 
Jones’s Diary), and the other half of each sample a film that didn’t fall into the category 
of ‘chick flick’ ( Memento , a brilliant film by the way). In all cases he measured their physi¬ 
ological arousal as an indicator of how much they enjoyed the film. The data are in a file 
called ChickFlick.dat on the companion website. Load this file into a dataframe called 
chickFlick by executing this command (I’m assuming you have set the working directory to 
be where the data file is stored): 



chickFlick <- read.deIimCChickFIick.dat 


header = TRUE) 
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Figure 4.22 shows the data. Note there are three variables: 

• gender: specifies the gender of the participant as text. 

• film: specifies the film watched as text. 

• arousal: is their arousal score. 

Each row in the data file represents a different person. 


FIGURE 4.22 
The ChickFlick. 

dat data 
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4.9.1.1 Bar charts for one independent variable © 


To begin with, let’s just plot the mean arousal score (y-axis) for each film (x-axis). We can 
set this up by first creating the plot object and defining any aesthetics that apply to the plot 
as a whole. I have called the object bar, and have created it using the ggplot() function. The 
function specifies the dataframe to be used ( chickFlick ) and has set film to be plotted on 
the x-axis, and arousal to be plotted on the y-axis: 

bar <- ggplotCchickFlick, aes(film, arousal)) 

This is where things get a little bit tricky; because we want to plot a summary of the data 
(the mean) rather than the raw scores themselves, we have to use a stat (section 4.4.5) to 
do this for us. Actually, we already used a stat when we plotted the boxplot in an earlier 
section, but we didn’t notice because the boxplot geom sneaks off when we’re not looking 
and uses the bin stat without us having to really do anything. However, if we want means 
then we have no choice but to dive head first into the pit of razors that is a stat. Specifically 
we are going to use stat_summary(). 

The stat_summary() function takes the following general form: 

stat_summary(function = x, geom = y) 

Functions can be specified either for individual points ( fun.y ) or for the data as a whole 
(fun.data) and are set to be common statistical functions such as ‘mean’, ‘median’ and so 
on. As you might expect, the geom option is a way of telling the stat which geom to use to 
represent the function, and this can take on values such as ‘errorbar’, ‘bar’ and ‘pointrange’ 
(see Table 4.3). The stat_summary() function takes advantage of several built-in functions 
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Table 4.4 Using stat_summary() to create graphs 


Option Plots Common geom 


fun.y = mean The mean geom = "bar” 

fun.y = median The median geom = “bar” 

fun.data = 95% confidence intervals assuming geom = “errorbar” 

mean_cl_normal() normality geom = “pointrange” 

fun.data = mean_cl_boot() 95% confidence intervals based on a geom = “errorbar” 

bootstrap (i.e., not assuming normality) geom = “pointrange” 

mean_sdl() Sample mean and standard deviation geom = “errorbar” 

geom = “pointrange” 

fun.data = median_hilow() Median and upper and lower quantiles geom = “pointrange” 


from the Hmisc package, which should automatically be installed. Table 4.4 summarizes 
these functions and how they are specified within the stat_summary() function. 

If we want to add the mean, displayed as bars, we can simply add this as a layer to ‘bar’ 
using the stat_summary() function: 

bar + stat_summary(fun.y = mean, geom = "bar", fill = "White", colour 
= "Black" 

As shown in Table 4.4, fun.y = mean computes the mean for us, geom = “bar” 
displays these values as bars, fill = “White” makes the bars white (the default is 
dark grey and you can replace with a different colour if you like), and colour = 

“Black” makes the outline of the bars black. 

If we want to add error bars to create an error bar chart, we can again add these 
as a layer using stat_summary(): 

+ stat_summary(fun.data = mean_cl_normal, geom = "pointrange") 

This command adds a standard 95% confidence interval in the form of the pointrange 
geom. Again, if you like you could change the colour of the pointrange geom by setting its 
colour as described in Table 4.2. 

Finally, let’s add some nice labels to the graph using lab(): 

+ labs(x = "Film", y = "Mean Arousal") 

To sum up, if we put all of these commands together we can create the graph by execut¬ 
ing the following command: 

bar + stat_summary(fun.y = mean, geom = "bar", fill = "White", colour = 
"Black") + stat_summary(fun.data = mean_cl_normal, geom = "pointrange") + 
labs(x = "Film", y = "Mean Arousal") 

Figure 4.23 shows the resulting bar chart. This graph displays the means (and the 95% con¬ 
fidence interval of those means) and shows us that on average, people were more aroused 
by Memento than they were by Bridget Jones’s Diary. Flowever, we originally wanted to 
look for gender effects, so we need to add this variable into the mix. 


SELF-TEST 

s Change the geom for the error bar to ‘errorbar’ and 
change its colour to red. Replot the graph. 
s Plot the graph again but with bootstrapped confidence 
intervals. 
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FIGURE 4.23 

Bar chart of the 
mean arousal for 
each of the two 
films 


35- 


30- 

25- ->- 

10 

v> 

2 20 - 

< 

c 

<C -ic 

a) 15- -II- 

2 

10 - 

5- 

0 - - - 

Bridget Jones’s Memento 

Diary 

Film 

4.9.I.2. Bar charts for several independent variables (D 

If we want to factor in gender we could do this in several ways. First we could set an aes¬ 
thetic (such as colour) to represent the different genders, but we could also use faceting to 
create separate plots for men and women. We could also do both. Let’s first look at sepa¬ 
rating men and women on the same graph. This takes a bit of work, but if we build up the 
code bit by bit the process should become clear. 

First, as always we set up our plot object (again I’ve called it bar). This command is the 
same as before, except that we have set the fill aesthetic to be the variable gender. This 
means that any geom specified subsequently will be filled with different colours for men 
and women. 

bar <- ggplot(chickFlick, aes(film, arousal, fill = gender)) 

If we want to add the mean, displayed as bars, we can simply add this as a layer to bar 
using the stat_summary() function as we did before, but with one important difference: we 
have to specify position = “dodge” (see section 4.4.6) so that the male and female bars are 
forced to stand side-by-side, rather than behind each other. 

bar + stat_summary(fun.y = mean, geom = "bar", position="dodge") 

As before, fun.y = mean computes the mean for us, geom = “bar” displays these values as 
bars. 

If we want to add error bars we can again add these as a layer using stat_summary(): 

+ stat_summary(fun.data = mean_cl_normal, geom = "errorbar", position = posi- 
tion_dodge(width=0.90), width =0.2) 

This command is a bit more complicated than before. Note we have changed the geom to 
errorbar-, by default these bars will be as wide as the bars displaying the mean, which looks 
a bit nasty, so I have reduced their width with width = 0.2, which should make them 20% 
of the width of the bar (which looks nice in my opinion). The other part of the command 
is that we have again had to use the dodge position to make sure that the error bars stand 
side-by-side). In this case position = position_dodge(width=0.90) does the trick, but you 
might have to play around with the values of width to get what you want. 
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Finally, let’s add some nice labels to the graph using lab(): 

+ labs(x = "Film", y = "Mean Arousal", fill = "Gender") 

Notice that as well as specifying titles for each axis, I have specified a title for fill. This will 
give a title to the legend on the graph (if we omit this option the legend will be given the 
variable name as a title, which might be OK for you if you are less anally retentive than I am). 

To sum up, if we put all of these commands together we can create the graph by execut¬ 
ing the following command: 

bar + stat_summary(fun.y = mean, geom = "bar", position="dodge") + stat_ 
summary(fun.data = mean_cl_normal, geom = "errorbar", position = position_ 
dodge(width = 0.90), width = 0.2) + labs(x = "Film", y = "Mean Arousal", fill 
= "Gender") 
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FIGURE 4.24 

Bar chart of the 
mean arousal for 
each of the two 
films 


Figure 4.24 shows the resulting bar chart. It looks pretty good, I think. It is possible to 
customise the colours that are used to fill the bars also (see R’s Souls’ Tip 4.3). Like the 
simple bar chart, this graph tells us that arousal was overall higher for Memento than for 
Bridget Jones’s Diary , but it also splits this information by gender. The mean arousal for 
Bridget Jones’s Diary shows that males were actually more aroused during this film than 
females. This indicates they enjoyed the film more than the women did. Contrast this with 
Memento, for which arousal levels are comparable in males and females. On the face of it, 
this contradicts the idea of a ‘chick flick’: it actually seems that men enjoy chick flicks more 
than the so-called ‘chicks’ do (probably because it’s the only help we get to understand the 
complex workings of the female mind©). 

The second way to express gender would be to use this variable as a facet so that we 
display different plots for males and females: 

bar <- ggplot(chickFlick, aes(film, arousal, fill = film)) 

Executing the above command sets up the graph in the same way as before. Note, however, 
that we do not need to use ‘fill = gender’ because we do not want to vary the colour by 
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gender. (You can omit the fill command altogether, but I have set it so that the bars repre¬ 
senting the different films are filled with different colours.) We set up the bar in the same 
way as before, except that we do not need to set the position to dodge because we are no 
longer plotting different bars for men and women on the same graph: 

bar + stat_summary(fun.y = mean, geom = "bar") 

We set up the error bar in the same way as before, except again we don’t need to include 
a dodge: 

+ stat_summary(fun.data = mean_cl_normal, geom = "errorbar", width =0.2) 

To get different plots for men and women we use the facet option and specify gender as the 
variable by which to facet: 

+ facet_wrap( ~ gender) 

We add labels as we did before: 

+ labsfx = "Film", y = "Mean Arousal") 

I’ve added an option to get rid of the graph legend as well (see R’s Souls’ Tip 4.2). I’ve 
included this option because we specified different colours for the different films so ggplot 
will create a legend; however, the labels on the x-axis will tell us to which film each bar 
relates so we don’t need a colour legend as well): 

+ optsflegend.position = "none") 

The resulting graph is shown in Figure 4.25; compare this with Figure 4.24 and note how 
by using gender as a facet rather than an aesthetic results in different panels for men and 
women. The graphs show the same pattern of results though: men and women differ little 
in responses to Memento, but men showed more arousal to Bridget Jones’s Diary. 


FIGURE 4.25 

The mean 
arousal (and 
95% confidence 
interval) for two 
different films 
displayed as 
different graphs 
for men and 
women using 
facet_wrap() 
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Custom colours (D 


If you want to override the default fill colours, you can do this using the scale_fill_manual() function. For our chick 
flick data, for example, if we wanted blue bars for females and green for males then we can add the following 
command: 


+ scale_fill_manual("Gender", c("Female" = "Blue", "Male" = "Green")) 

Alternatively, you can use very specific colours by specifying colours using the RRGGBB system. For example, 
the following produces very specifically coloured blue and green bars: 

+ scale_fill_manual("Gender", c("Female" = "#3366FF", "Male" = "#336633")) 


Try adding these commands to the end of the command we used to generate Figure 4.24 and see the effect it 
has on the bar colours. Then experiment with other colours. 


Line graphs (D 


4.9.2.1. Line graphs of a single independent variable 


Hiccups can be a serious problem: Charles Osborne apparently got a case of 
hiccups while slaughtering a hog (well, who wouldn’t?) that lasted 67 years. 

People have many methods for stopping hiccups (a surprise, holding your 
breath), but actually medical science has put its collective mind to the task too. 

The official treatment methods include tongue-pulling manoeuvres, massage 
of the carotid artery, and, believe it or not, digital rectal massage (Fesmire, 

1988). I don’t know the details of what the digital rectal massage involved, 
but I can probably imagine. Let’s say we wanted to put digital rectal massage 
to the test (as a cure for hiccups, I mean). We took 15 hiccup sufferers, and 
during a bout of hiccups administered each of the three procedures (in ran¬ 
dom order and at intervals of 5 minutes) after taking a baseline of how many 
hiccups they had per minute. We counted the number of hiccups in the minute after 
each procedure. Load the file Hiccups.dat from the companion website into a dataframe 
called hiccupsData by executing (again assuming you have set your working directory to 
be where the file is located): 

hiccupsData <- read.delim("Hiccups.dat", header = TRUE) 

Figure 4.26 shows the data. Note there are four variables: 




• Baseline: specifies the number of hiccups at baseline. 

• Tongue: specifies the number of hiccups after tongue pulling. 

• Carotid: specifies the number of hiccups after carotid artery massage. 

• Rectum: specifies the number of hiccups after digital rectal massage. 
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FIGURE 4.26 

The Hiccups.dat 
data 
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Each row in the data file represents a different person, so these data are laid out as a 
repeated-measures design, with each column representing a different treatment condition 
and every person undergoing each treatment. 

These data are in the wrong format for ggplot2 to use. We need all of the scores stacked 
up in a single column and then another variable that specifies the type of intervention. 



SELF-TEST 

A Thinking back to Chapter 3, use the stack() function to 
restructure the data into long format. 


We can rearrange the data as follows (see section 3.9.4): 
hiccups<-stack(hiccupsDatcO 

names(hiccups)<-c("Hiccups","Intervention") 

Executing these commands creates a new dataframe called hiccups, which has the number 
of hiccups in one column alongside a new variable containing the original variable name 
associated with each score (i.e., the column headings) in the other column (Figure 4.27). 
The names() function just assigns names to these new variables in the order that they 
appear in the dataframe. To plot a categorical variable in ggplot() it needs to be recog¬ 
nized as a factor, so we also need to create new variable in the hiccups dataframe called 
Intervention_Factor, which is just the Intervention variable converted into a factor: 

hiccups$Intervention_Factor <- factor(hiccups$Intervention, levels = 
hiccups$I intervention) 

We are now ready to plot the graph. As always we first create the plot object and define 
the variables that we want to plot as aesthetics: 

line <- ggplotfhiccups, aes(Intervention_Factor, Hiccups)) 

I have called the object line, and have created it using the ggplot() function. The function 
specifies the dataframe to be used ( hiccups ) and has set Intervention_Factor to be plotted 
on the x-axis, and Hiccups to be plotted on the y-axis. 
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FIGURE 4.27 

The hiccups data 
in long format 


Just as we did for our bar charts, we are going to use stat_summary() to create the mean 
values within each treatment condition. Therefore, as with the bar chart, we create a layer 
using stat_summary() and add this to the plot: 

line + stat_summary(fun.y = mean, geom = "point") 

Note that this command is exactly the same as for a bar chart, except that we have 
chosen the point geom rather than a bar. At the moment we have a plot with a symbol 
representing each group mean. If we want to connect these symbols with a line then we use 
stat_summary() again, we again specify fun.y to be the mean, but this time choose the line 
geom. To make the line display we also need to set an aesthetic of group = 1 ; this is because 
we are joining summary points (i.e., points that summarize a group) rather than individual 
data points. Therefore, we specify the line as: 

+ stat_summary(fun.y = mean, geom = "line", aesCgnoup = 1)) 

The above command will add a solid black line connecting the group means. Let’s imagine 
we want this line to be blue, rather than black, and dashed rather than solid, we can simply 
add these aesthetics into the above command as follows: 

+ stat_summary(fun.y = mean, geom = "line", aesCgnoup = 1), colour = "Blue", 
linetype = “dashed”) 

Now let’s add an error bar to each group mean. We can do this by adding another layer 
using stat_summary(). When we plotted an error bar on the bar chart we used a normal 
error bar, so this time let’s add an error bar based on bootstrapping. We set the function 
for the data to be mean_cl_boot [fun.data = mean_cl_boot) - see Table 4.4 - and set the 
geom to be errorbar (you could use pointrange as we did for the bar chart if you prefer): 

+ stat_summary(fun.data = mean_cl_boot, geom = "errorbar") 
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The default error bars are quite wide, so I recommend setting the width parameter to 0.2 
to make them look nicer: 

+ stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width =0.2) 

You can, of course, also change the colour and other properties of the error bar in the usual 
way (e.g., by adding colour = “Red” to make them red). Finally, we will add some labels to 
the x- and y-axes using the labs() function: 

+ labs(x = "Intervention", y = "Mean Number of Hiccups") 

If we put all of these commands together, we can create the graph by executing the fol¬ 
lowing command: 

line <- ggplot(hiccups, aes(Intervention_Factor, Hiccups)) 

line + stat_summary(fun.y = mean, geom = "point") + stat_summary(fun.y = 
mean, geom = "line", aes(group = 1),colour = "Blue", linetype = "dashed") 
+ stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.2) + 
labs(x = "Intervention", y = "Mean Number of Hiccups") 


FIGURE 4.28 

Line chart with 
error bars of the 
mean number 
of hiccups at 
baseline and 
after various 
interventions 
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The resulting graph in Figure 4.28 displays the mean number of hiccups at baseline and 
after the three interventions (and the confidence intervals of those means based on boot¬ 
strapping). As we will see in Chapter 9, the error bars on graphs of repeated-measures 
designs aren’t corrected for the fact that the data points are dependent; I don’t want to get 
into the reasons why here because I want to keep things simple, but if you’re doing a graph 
of your own data then I would read section 9.2 before you do. 

We can conclude that the amount of hiccups after tongue pulling was about the same as 
at baseline; however, carotid artery massage reduced hiccups, but not by as much as a good 
old fashioned digital rectal massage. The moral here is: if you have hiccups, find something 
digital and go amuse yourself for a few minutes. 
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4.9.2 2. Line graphs for several independent variables © 


We all like to text-message (especially students in my lectures who feel the need to text- 
message the person next to them to say ‘Bloody hell, this guy is so boring I need to poke 
out my own eyes’). What will happen to the children, though? Not only will they develop 
super-sized thumbs, they might not learn correct written English. Imagine we conducted 
an experiment in which a group of 25 children was encouraged to send text messages on 
their mobile phones over a six-month period. A second group of 25 children was forbidden 
from sending text messages for the same period. To ensure that kids in this latter group 
didn’t use their phones, this group was given armbands that administered painful shocks in 
the presence of radio waves (like those emitted from phones). 11 The outcome was a score 
on a grammatical test (as a percentage) that was measured both before and after the inter¬ 
vention. The first independent variable was, therefore, text message use (text messagers 
versus controls) and the second independent variable was the time at which grammatical 
ability was assessed (baseline or after 6 months). The data are in the file Text Messages.dat. 

Load this file into a dataframe called textData by executing this command (I’m assuming 
you have set the working directory to be where the data file is stored): 



textData <- read. delim("TextMessages .dat", header = TRUE) 


Figure 4.29 shows the data. Note there are three variables: 


• Group: specifies whether they were in the text message group or the control group. 

• Baseline: grammar scores at baseline. 

• Six_months: grammar scores after 6 months. 


Each row in the data file represents a different person. These data are again in the wrong 
format for ggplot2. Instead of the current wide format, we need the data in long (i.e., mol¬ 
ten) format (see section 3.9.4). This format will have the following variables: 

• Group: specifies whether they were in the text message group or the control group. 

• Time: specifies whether the score relates to baseline or 6 months. 

• Grammar_Score: the grammar scores. 



SELF-TEST 

s Restructure the data to a new dataframe called 
textMessages that is in long format. Use the factorQ 
function (see section 3.5.4.3) to convert the ‘Time’ 
variable to a factor with levels called ‘Baseline’ and ‘6 
Months’. 


Assuming that you have done the self-test, you should now have a dataframe called 
textMessages that is formatted correctly for ggplotl. As ever, we set up our plot object 
(I’ve called it line). This command is the same as before, except that we have set the ‘fill’ 
aesthetic to be the variable Group. This means that any geom specified subsequently will 


11 Although this punished them for any attempts to use a mobile phone, because other people’s phones also emit 
microwaves, an unfortunate side effect was that these children acquired a pathological fear of anyone talking on 
a mobile phone. 
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FIGURE 4.29 
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be filled with different colours for text messagers and the control group. Note that we 
have specified the data to be the textMessages dataframe, and for Time to be plotted on the 
x-axis and Grammar_Score on the y-axis. 

line <- ggplot(textMessages, aes(Time, Grammar_Score, colour = Group)) 

If we want to add the means, displayed as symbols, we can add this as a layer to line using 
the stat_summary() function just as we did in the previous section: 

line + stat_summary(fun.y = mean, geom = "point") 

To add lines connecting the means we can add these as a layer using stat_summary() in 
exactly the same way as we did in the previous section. The main difference is that because 
in this example we have more than one group, rather than setting aes(group = 1) as we did 
before, we now set this aesthetic to be the variable (Group) that differentiates the different 
sets of means (aes(group = Group)): 

+ stat_summary(fun.y = mean, geom = "line", aesCgroup = Group)) 

We can also add a layer containing error bars and a layer containing labels using the same 
commands as the previous example: 

+ stat_summary(fun.data = mean_cl_boot, geom = "errorbar", width = 0.2) + 
labs(x = "Time", y = "Mean Grammar Score", colour = "Group") 

If we put all of these commands together we can create the graph by executing the fol¬ 
lowing command: 

line + stat_summary(fun.y = mean, geom = "point") + stat_summary(fun.y = 
mean, geom = "line", aesCgroup = Group)) + stat_summary(fun.data = mean_cl_ 
boot, geom = "errorbar", width = 0.2) + labs(x = "Time", y = "Mean Grammar 
Score", colour = "Group") 
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SELF-TEST 

s Use what you have learnt to repeat the text message 
data plot but to also have different symbols for text 
messagers and controls and different types of lines. 


Figure 4.30 shows the resulting chart. It shows that at baseline (before the intervention) 
the grammar scores were comparable in our two groups; however, after the intervention, 
the grammar scores were lower in the text messagers than in the controls. Also, if you look 
at the dark blue line you can see that text messagers’ grammar scores have fallen over the 
6 months; compare this to the controls (the red line on your screen, or black in the figure) 
whose grammar scores are fairly similar over time. We could, therefore, conclude that text 
messaging has a detrimental effect on children’s understanding of English grammar and 
civilization will crumble, with Abaddon rising cackling from his bottomless pit to claim our 
wretched souls. Maybe. 
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FIGURE 4.30 
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4.10. Themes and options © 


I mentioned earlier that ggplot2 produces Tufte-friendly graphs. In fact, it has two built-in 
themes. The default is called theme_grey(), which follows Tufte’s advice in that it uses grid 
lines to ease interpretation but makes them have low visual impact so that they do not dis¬ 
tract the eye from the data. The second theme is a more traditional black and white theme 
called theme_bw(). The two themes are shown in Figure 4.31. 

As well as these global themes, the opts() function allows you to control the look of 
specific parts of the plot. For example, you can define a title, set the properties of that title 
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FIGURE 4.31 
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(size, font, colour, etc.). You can also change the look of axes, grid lines, background panels 
and text. You apply theme and formatting instructions by adding a layer to the plot: 

myGraph + geom_pointO + opts() 

Table 4.5 shows these themes, their aesthetic properties and the elements of the plot 
associated with them. The table makes clear that there are four types of theme that 


Table 4.5 Summary of theme elements and their properties 


Theme 

Properties 

Elements 

Element Description 

theme_text() 

family 

axis.text.x 

x-axis label 


face 

axis.text.y 

y-axis label 


colour 

size 

hjust 

axis.title.x 

axis.title.y 

Horizontal tick labels 

Vertical tick labels 


vjust 

legend.text 

Legend labels 


angle 

lineheight 

legend.title 

plot.title 

Legend name 

Plot title 



strip, text, x 

Horizontal facet label text 



strip.text.y 

Vertical facet label text 

theme_line() 

colour 

panel.grid.major 

Major grid lines 


size 

linetype 

panel.grid.minor 

Minor grid lines 

theme_segment() 

colour 

axis.line 

Line along an axis 


size 

linetype 

axis.ticks 

Axis tick marks 

theme_rect() 

colour 

legend.background 

Background of legend 


size 

linetype 

fill 

legend.key 

panel.background 

Background under legend key 

Background of panel 



panel.background 

Border of panel 



plot.background 

Background of the entire plot 



strip.background 

Background of facet labels 
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determine the appearance of text ( theme_text ), lines ( themejine ), axes (theme_segment) 
and rectangles ( theme_rect ). Each of these themes has properties that can be adjusted; so 
for all of them you can adjust size and colour, for text you can also adjust things like the 
font family and angle, for rectangles you can change the fill colour and so on. Different 
elements of a plot can be changed by adjusting the particular theme attached to that ele¬ 
ment. So, for example, if we wanted to change the colour of the major grid lines to blue, we 
would have to do this by setting the colour aesthetic of the panel.grid.major element using 
theme_line(). Aesthetic properties are set in the same way as described in section 4.4.3. 
Therefore, we would do this as follows: 

+ optsCpanel.grid.major = theme_line(colour = "Blue")) 

Similarly, we could make the axes have blue lines with: 

+ opts(axis.line = theme_segment(colour = "Blue")) 
or dashed lines by using: 

+ opts(axis.line = theme_segment(linetype = 2)) 

The possibilities are endless, and I can’t explain them all without killing several more 
rainforests, but I hope that you get the general idea. 




What have I discovered about statistics? © 


This chapter has looked at how to inspect your data using graphs. We’ve covered a lot 
of different graphs. We began by covering some general advice on how to draw graphs 
and we can sum that up as minimal is best: no pink, no 3-D effects, no pictures of Errol 
your pet ferret superimposed on the graph - oh, and did I mention no pink? We have 
looked at graphs that tell you about the distribution of your data (histograms, boxplots 
and density plots), that show summary statistics about your data (bar charts, error bar 
charts, line charts, drop-line charts) and that show relationships between variables (scat- 
terplots). Throughout the chapter we looked at how we can edit graphs to make them 
look minimal (and of course to colour them pink, but we know better than to do that, 
don’t we?). 

We also discovered that I liked to explore as a child. I was constantly dragging my dad 
(or was it the other way around?) over piles of rocks along any beach we happened to 
be on. However, at this time I also started to explore great literature, although unlike 
my cleverer older brother who was reading Albert Einstein’s papers (well, Isaac Asimov) 
as an embryo, my literary preferences were more in keeping with my intellect, as we 
will see. 


R packages used in this chapter 


ggpiot2 
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R functions used in this chapter 


file.path() 

geom_boxplot() 

geom_density() 

geom_histogram() 

geom_line() 

geom_point() 

geom_smooth() 


ggpioto 

ggsave() 

labs() 

opts() 

qplot() 

stat_summary() 

Sys.getenvQ 


Key terms that I’ve discovered 


Bar chart 

Boxplot (box-whisker plot) 

Chartjunk 

Density plot 

Error bar chart 


Line chart 
Outlier 

Regression line 
Scatterplot 


Smart Alex’s tasks 




• Task 1: Using the data from Chapter 3 (which you should have saved, but if you 
didn’t, re-enter it from Table 3.6), plot and interpret the following graphs: © 

o An error bar chart showing the mean number of friends for students and lecturers, 
o An error bar chart showing the mean alcohol consumption for students and 
lecturers. 

o An error line chart showing the mean income for students and lecturers, 
o An error line chart showing the mean neuroticism for students and lecturers, 
o A scatterplot with regression lines of alcohol consumption and neuroticism 
grouped by lecturer/student. 

• Task 2: Using the Infidelity data from Chapter 3 (see Smart Alex’s Task 3), plot a 
clustered error bar chart of the mean number of bullets used against self and partner 
for males and females. © 

Answers can be found on the companion website. 


Further reading 


Tufte, E. R. (2001). The visual display of quantitative information (2nd ed.). Cheshire, CT: Graphics 
Press. 

Wainer, H. (1984). How to display data badly. American Statistician , 38(2), 137-147. 

Wickham, H. (2009). ggplot2: Elegant graphics for data analysis. New York: Springer. 

Wilkinson, L. (2005). The grammar of graphics. New York: Springer-Verlag. 









CHAPTER 4 EXPLORING DATA WITH GRAPHS 


165 


Wright, D. B., & Williams, S. (2003). Producing bad results sections. The Psychologist, 16, 646-648. 
(This is a very accessible article on how to present data. Dan usually has this article on his website 
so Google Dan Wright to find where his web pages are located.) 


Web resources: 


http://junkcharts.typepad.com/ is an amusing look at bad graphs. 
http://had.co.nz/ggplot2/ is the official ggplot2 website (and very useful it is, too). 


Interesting real research 


Fesmire, F. M. (1988). Termination of intractable hiccups with digital rectal massage. Annals of 
Emergency Medicine, 17(8), 872. 





Exploring assumptions 



FIGURE 5.1 
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5.1. What will this chapter tell me? © 


When we were learning to read at primary school, we used to read versions of stories by 
the famous storyteller Hans Christian Andersen. One of my favourites was the story of 
the ugly duckling. This duckling was a big ugly grey bird, so ugly that even a dog would 
not bite him. The poor duckling was ridiculed, ostracized and pecked by the other ducks. 
Eventually, it became too much for him and he flew to the swans, the royal birds, hoping 
that they would end his misery by killing him because he was so ugly. As he stared into the 
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water, though, he saw not an ugly grey bird but a beautiful swan. Data are much the same. 
Sometimes they’re just big, grey and ugly and don’t do any of the things that they’re sup¬ 
posed to do. When we get data like these, we swear at them, curse them, peck them and 
hope that they’ll fly away and be killed by the swans. Alternatively, we can try to force our 
data into becoming beautiful swans. That’s what this chapter is all about: assessing how 
much of an ugly duckling of a data set you have, and discovering how to turn it into a swan. 
Remember, though, a swan can break your arm. 1 


5.2. What are assumptions? © 


Some academics tend to regard assumptions as rather tedious things about which 
no one really need worry. When I mention statistical assumptions to my fellow 
psychologists they tend to give me that raised eyebrow, ‘good grief, get a life’ 
look and then ignore me. However, there are good reasons for taking assump¬ 
tions seriously. Imagine that I go over to a friend’s house, the lights are on and 
it’s obvious that someone is at home. I ring the doorbell and no one answers. 

From that experience, I conclude that my friend hates me and that I am a ter¬ 
rible, unlovable person. How tenable is this conclusion? Well, there is a reality 
that I am trying to tap (i.e., whether my friend likes or hates me), and I have 
collected data about that reality (I’ve gone to his house, seen that he’s at home, 
rung the doorbell and got no response). Imagine that in reality my friend likes me (he’s a 
lousy judge of character); in this scenario, my conclusion is false. Why have my data led me 
to the wrong conclusion? The answer is simple: I had assumed that my friend’s doorbell 
was working and under this assumption the conclusion that I made from my data was accu¬ 
rate (my friend heard the bell but chose to ignore it because he hates me). However, this 
assumption was not true - his doorbell was not working, which is why he didn’t answer 
the door - and as a consequence the conclusion I drew about reality was completely false. 
It pays to check assumptions and your doorbell batteries. 

Enough about doorbells, friends and my social life: the point to remember is that when 
assumptions are broken we stop being able to draw accurate conclusions about reality. 
Different statistical models assume different things, and if these models are going to reflect 
reality accurately then these assumptions need to be true. This chapter is going to deal with 
some particularly ubiquitous assumptions so that you know how to slay these particular 
beasts as we battle our way through the rest of the book. However, be warned: some tests 
have their own unique two-headed, fire-breathing, green-scaled assumptions and these will 
jump out from behind a mound of blood-soaked moss and try to eat us alive when we least 
expect them to. Onward into battle ... 



5.3. Assumptions of parametric data © 


Many of the statistical procedures described in this book are paramet¬ 
ric tests based on the normal distribution (which is described in section 
1.7.4). A parametric test is one that requires data from one of the large 
catalogue of distributions that statisticians have described, and for data to 
be parametric certain assumptions must be true. If you use a parametric 
test when your data are not parametric then the results are likely to be 
inaccurate. Therefore, it is very important that you check the assump¬ 
tions before deciding which statistical test is appropriate. Throughout 



1 Although it is theoretically possible, apparently you’d have to be weak boned, and swans are nice and wouldn’t 
do that sort of thing. 
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this book you will become aware of my obsession with assumptions and checking them. 
Most parametric tests based on the normal distribution have four basic assumptions that 
must be met for the test to be accurate. Many students find checking assumptions a pretty 
tedious affair, and often get confused about how to tell whether or not an assumption has 
been met. Therefore, this chapter is designed to take you on a step-by-step tour of the 
world of parametric assumptions. Now, you may think that assumptions are not very excit¬ 
ing, but they can have great benefits: for one thing, you can impress your supervisor/ 
lecturer by spotting all of the test assumptions that they have violated throughout their 
careers. You can then rubbish, on statistical grounds, the theories they have spent their 
lifetime developing - and they can’t argue with you, 2 but they can poke your eyes out. The 
assumptions of parametric tests are: 

1 Normally distributed data: This is a tricky and misunderstood assumption because it 
means different things in different contexts. For this reason I will spend most of the 
chapter discussing this assumption. In short, the rationale behind hypothesis test¬ 
ing relies on having something that is normally distributed (in some cases it’s the 
sampling distribution, in others the errors in the model), and so if this assumption 
is not met then the logic behind hypothesis testing is flawed (we came across these 
principles in Chapters 1 and 2). 

2 Homogeneity of variance: This assumption means that the variances should be the 
same throughout the data. In designs in which you test several groups of participants 
this assumption means that each of these samples comes from populations with the 
same variance. In correlational designs, this assumption means that the variance of 
one variable should be stable at all levels of the other variable (see section 5.7). 

3 Interval data: Data should be measured at least at the interval level. This assumption 
is tested by common sense and so won’t be discussed further (but do read section 
1.5.1.2 again to remind yourself of what we mean by interval data). 

4 Independence: This assumption, like that of normality, is different depending on the 
test you’re using. In some cases it means that data from different participants are inde¬ 
pendent, which means that the behaviour of one participant does not influence the 
behaviour of another. In repeated-measures designs (in which participants are mea¬ 
sured in more than one experimental condition), we expect scores in the experimental 
conditions to be non-independent for a given participant, but behaviour between dif¬ 
ferent participants should be independent. As an example, imagine two people, Paul 
and Julie, were participants in an experiment where they had to indicate whether they 
remembered having seen particular photos earlier on in the experiment. If Paul and 
Julie were to confer about whether they’d seen certain pictures then their answers 
would not be independent: Julie’s response to a given question would depend on Paul’s 
answer, and this would violate the assumption of independence. If Paul and Julie were 
unable to confer (if they were locked in different rooms) then their responses should be 
independent (unless they’re telepathic): Julie’s should not influence Paul’s responses. 
In regression, however, this assumption also relates to the errors in the regression 
model being uncorrelated, but we’ll discuss that more in Chapter 7. 

We will, therefore, focus in this chapter on the assumptions of normality and homogeneity 
of variance. 


2 When I was doing my Ph.D., we were set a task by our statistics lecturer in which we had to find some published 
papers and criticize the statistical methods in them. I chose one of my supervisor’s papers and proceeded to slag 
off every aspect of the data analysis (and I was being very pedantic about it all). Imagine my horror when my 
supervisor came bounding down the corridor with a big grin on his face and declared that, unbeknownst to me, 
he was the second marker of my essay. Luckily, he had a sense of humour and I got a good mark.© 
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5.4. Packages used in this chapter © 


Some useful packages for exploring data are car, ggplot2 (for graphs), pastecs (for descrip¬ 
tive statistics) and psych. Of course, if you plan to use R Commander then you need the 
Rcmdr package installed too (see section 3.6). If you do not have these packages installed, 
you can install them by executing the following commands: 

install.packagesC'car"); install.packages("ggplot2"); 

install.packagesC'pastecs"); install.packagesC'psych") 

You then need to load these packages by executing the commands: 

library(car); library(ggplot2); library(pastecs); library(psych); 
library(Rcmdr) 

5.5. The assumption of normality © 


We encountered the normal distribution back in Chapter 1, we know what it looks like 
and we (hopefully) understand it. You’d think then that this assumption would be easy to 
understand - it just means that our data are normally distributed, right? Actually, no. In 
many statistical tests (e.g., the i-test) we assume that the sampling distribution is normally 
distributed. This is a problem because we don’t have access to this distribution - we can’t 
simply look at its shape and see whether it is normally distributed. However, we know 
from the central limit theorem (section 2.5.1) that if the sample data are approximately 
normal then the sampling distribution will be also. Therefore, people tend to look at their 
sample data to see if they are normally distributed. If so, then they have a little party to 
celebrate and assume that the sampling distribution (which is what actually matters) is also. 
We also know from the central limit theorem that in big samples the sampling distribu¬ 
tion tends to be normal anyway - regardless of the shape of the data we actually collected 
(and remember that the sampling distribution will tend to be normal regardless of the 
population distribution in samples of 30 or more). As our sample gets bigger, then, we 
can be more confident that the sampling distribution is normally distributed (but see Jane 
Superbrain Box 5.1). 

The assumption of normality is also important in research using regression (or general 
linear models). General linear models, as we will see in Chapter 7, assume that errors in the 
model (basically, the deviations we encountered in section 2.4.2) are normally distributed. 

In both cases it might be useful to test for normality, and that’s what this section is 
dedicated to explaining. Essentially, we can look for normality visually, look at values that 
quantify aspects of a distribution (i.e., skew and kurtosis) and compare the distribution we 
have to a normal distribution to see if it is different. 


Oh no, it’s that pesky frequency distribution again: 
checking normality visually © 

We discovered in section 1.7.1 that frequency distributions are a useful way to look at 
the shape of a distribution. In addition, we discovered how to plot these graphs in sec¬ 
tion 4.4.8. Therefore, we are already equipped to look for normality in our sample using 
a graph. Let’s return to the Download Festival data from Chapter 4. Remember that a 
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biologist had visited the Download Festival (a rock and heavy metal festival in the UK) and 
assessed people’s hygiene over the three days of the festival using a standardized technique 
that results in a score ranging between 0 (you smell like a rotting corpse that’s hiding up a 
skunk’s anus) and 4 (you smell of sweet roses on a fresh spring day). The data file can be 
downloaded from the companion website (DownloadFestival.dat) - remember to use the 
version of the data for which the outlier has been corrected (if you haven’t a clue what I 
mean, then read section 4.4.8 or your graphs will look very different from mine!). 



SELF-TEST 

s Using what you learnt in Chapter 4, plot histograms 
for the hygiene scores for the three days of the 
Download Festival. (For reasons that will become 
apparent, use geom_histogram(aes(y = ..density..) 
rather than geom_histogram().) 


When you drew the histograms, this gave you the distributions. It might be nice to also 
have a plot of what a normal distribution looks like, for comparison purposes. Even better 
would be if that we could put a normal distribution onto the same plot. Well, we can using 
the power olggplotZ. First, load in the data: 

dlf <- read.delimC'DownloadFestival.dot", header=TRUE) 

To draw the histogram, you should have used code something like: 

hist.dayl <- ggplotfdlf, aes(dayl)) + opts(legend.position = "none") + 
geom_histogram(aes(y = ..density..), colour = "black", fill = "white") + 
labs(x = "Hygiene score on day 1", y = "Density") 

hist.dayl 

To see what this function is doing we can break down the command: 

• ggpl°t(dlf aes(dayl)): This tells R to plot the dayl variable from the d//dataframe. 

• opts(legend.position = “none”): This command gets rid of the legend of the graph. 

• geom_histogram(aes(y=..density..), colour = “black”, fill=”white”): This command 
plots the histogram, sets the line colour to be black and the fill colour to be white. 
Notice that we have asked for a density plot rather than frequency because we want 
to plot the normal curve. 

• labs(x = “Hygiene score on day 1”, y = “Density”): this command sets the labels for 
the x- and y-axes. 

We can add another layer to the chart, which is a normal curve. We need to tell ggplot2 
what mean and standard deviation we’d like on that curve though. And what we’d like is 
the same mean and standard deviation that we have in our data. To add the normal curve, 
we take the existing histogram object ( hist.dayl ) and add a new layer that uses stat_func- 
tion() to produce a normal curve and lay it on top of the histogram: 

hist.dayl + stat_function(fun = dnorm, args = listfmean = mean(dlf$dayl, 
na.rm = TRUE), sd = sd(dlf$dayl, na.rm = TRUE)), colour = "black", size = 1) 

The stat_function() command draws the normal curve using the function dnorm(). This 
function basically returns the probability (i.e., the density) for a given value from a normal 
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distribution of known mean and standard deviation. The rest of the command specifies the 
mean as being the mean of the dayl variable after removing any missing values (mean = 
mean (dlf$dayl, na.rm = TRUE)), and the standard deviation as being that of dayl (j, sd = 
sd(dlf$dayl, na.rm = TRUE)). We also set the line colour as black and the line width as l. 3 


SELF-TEST 

s Add normal curves to the histograms that you drew 

for day2 and day3 



There is another useful graph that we can inspect to see if a distribution is normal called a 
Q-Q plot (quantile-quantile plot; a quantile is the proportion of cases we find below a certain 
value). This graph plots the cumulative values we have in our data against the cumulative 
probability of a particular distribution (in this case we would specify a normal distribu¬ 
tion). What this means is that the data are ranked and sorted. Each value is compared to 
the expected value that the score should have in a normal distribution and they are plotted 
against one another. If the data are normally distributed then the actual scores will have the 
same distribution as the score we expect from a normal distribution, and you’ll get a lovely 
straight diagonal line. If values fall on the diagonal of the plot then the variable is normally 
distributed, but deviations from the diagonal show deviations from normality. 

To draw a Q-Q plot using the ggplot2 package, we can use the qplot() function in con¬ 
junction with the qq statistic. Execute the following code: 

qqplot.dayl <- qplot(sample = dlf$dayl, stat="qq") 

qqplot.dayl 

(Note that by default ggplotl assumes you want to compare your distribution with a nor¬ 
mal distribution - you can change that if you want to, but it’s so rare that we’re not going 
to worry about it here.) 



SELF-TEST 

s Create Q-Q plots for the variables day2 and day3. 


Figure 5.2 shows the histograms (from the self-test task) and the corresponding Q-Q 
plots. The first thing to note is that the data from day 1 look a lot more healthy since we’ve 
removed the data point that was mistyped back in section 4.7. In fact the distribution is 
amazingly normal looking: it is nicely symmetrical and doesn’t seem too pointy or flat - 
these are good things! This is echoed by the Q-Q plot: note that the data points all fall very 
close to the ‘ideal’ diagonal line. 


3 1 have built up the histogram and normal plot in two stages because I think it makes it easier to understand what 
you’re doing, but you could build the plot in a single command: 

hist.dayl <- ggplot(dlf, aes(dayl)) + opts(legend.position = "none") + geom_ 
histogram(aes(y = ..density..), colour = "black", fill = "white") + labs(x = 
"Hygiene score on day 1", y = "Density") + stat_function(fun = dnorm, args = 
list(mean = mean(dlf$dayl, na.rm = TRUE), sd = sd(dlf$dayl, na.rm = TRUE)), colour 
= "black", size = 1) 

hist.dayl 
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FIGURE 5.2 

Histograms (left) 
and Q-Q plots 
(right) of the 
hygiene scores 
over the three 
days of the 
Download Festival 
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However, the distributions for days 2 and 3 are not nearly as symmetrical. In fact, they 
both look positively skewed. Again, this can be seen in the Q-Q plots by the data val¬ 
ues deviating away from the diagonal. In general, what this seems to suggest is that by 
days 2 and 3, hygiene scores were much more clustered around the low end of the scale. 
Remember that the lower the score, the less hygienic the person is, so this suggests that 
generally people became smellier as the festival progressed. The skew occurs because a 
substantial minority insisted on upholding their levels of hygiene (against all odds!) over 
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the course of the festival (I find baby wet-wipes are indispensable). However, these skewed 
distributions might cause us a problem if we want to use parametric tests. In the next sec¬ 
tion we’ll look at ways to try to quantify the skew and kurtosis of these distributions. 


5 . 5 . 2 . 


Quantifying normality with numbers © 


It is all very well to look at histograms, but they are subjective and open to abuse (I can 
imagine researchers sitting looking at a completely distorted distribution and saying ‘yep, 
well Bob, that looks normal to me’, and Bob replying ‘yep, sure does’). Therefore, having 
inspected the distribution of hygiene scores visually, we can move on to look at ways to quan¬ 
tify the shape of the distributions and to look for outliers. To further explore the distribution 
of the variables, we can use the describe!) function, in the psych package. 

describe(dlf$dayl) 

We can also use the stat.desc() function of the pastecs package, 4 which takes the general 
form: 

stat.desc(variable name, basic = TRUE, norm = FALSE) 

In this function, we simply name our variable and by default (i. e., if we simply name a vari¬ 
able and don’t include the other commands) we’ll get a whole host of statistics including 
some basic ones such as the number of cases (because basic = TRUE by default) but not 
including statistics relating to the normal distribution (because norm = FALSE by default). 
To my mind the basic statistics are not very useful so I usually specify basic = FALSE (to 
get rid of these), but in the current context it is useful to override the default and specify 
norm = TRUE so that we get statistics relating to the distribution of scores. Therefore, we 
could execute: 

stat.desc(dlf$dayl, basic = FALSE, norm = TRUE) 

Note that we have specified the variable dayl in the dlf dataframe, asked not to see the 
basic statistics ( basic = FALSE) but asked to see the normality statistics ( norm = TRUE). 

We can also use describe() and stat.desc() with more than one variable at the same time, 
using the cbind() function to combine two or more variables (see R’s Souls’ Tip 3.5). 

describe(cbind(dlf$dayl, dlf$day2, dlf$day3)) 

stat.desc(cbindCdlf$dayl, dlf$day2, dlf$day3), basic = FALSE, norm = TRUE) 

Note that in each case we have simply replaced a single variable with cbind(dlf$dayl, 
dlf$day2, dlf$day3) which combines the three variables dayl, day2, and day3 into a single 
object. 

A second way to describe more than one variable is to select the variable names directly 
from the data set (see section 3.9.1): 

describe(dlf[,c("dayl", "day2", M day3")]) 

stat.desc(dlf[, c("dayl", "day2", "day3")], basic = FALSE, norm = TRUE) 


4 There’s always a second way to do something with R. And often a third, fourth and fifth way. While writing this 
book Jeremy and I would often look at each other’s bits (and sometimes what we’d written too) and then send an 
email saying ‘oh, I didn’t know you could do that, I always use a different function in a different package’. People 
can become quite attached to their ‘favourite’ way of doing things in R, but obviously we’re way too cool to have 
favourite ways of doing stats, which is why I didn’t at all insist on adding reams of stuff on stat.descf) because I 
prefer it to Jeremy’s crappy old describe() function. 
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Remember that we can select rows and columns using [rows, columns], therefore, dlf[, 
c(“dayl”, “day2”, “day3”)[ means from the dlf dataframe select all of the rows (because 
nothing is specified before the comma) and select the columns labelled dayl, day2, and 
day3 (because we have specified c(“dayl”, “day2”, “day3”)). 
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Funny numbers © 


You might notice that R sometimes reports numbers with the letter ‘e’ placed in the mix just to confuse you. For 
example, you might see a value such as 9.612 e-02 and many students find this notation confusing. Well, this 
notation means 9.612 x 10" 2 (which might be a more familiar notation, or could be even more confusing). OK, 
some of you are still confused. Well think of e-02 as meaning 'move the decimal place 2 places to the left’, so 
9.612 e-02 becomes 0.09612. If the notation read 9.612 e-01, then that would be 0.9612, and if it read 9.612 
e-03, that would be 0.009612. Likewise, think of e+02 (notice the minus sign has changed) as meaning ‘move 
the decimal place 2 places to the right’. So 9.612 e+02 becomes 961.2. 


The results of these commands are shown in Output 5.1 ( describe ) and Output 5.2 
(stat.desc). These outputs basically contain the same values 5 although they are presented 
in a different notation in Output 5.2 (see R’s Souls’ Tip 5.1). We can see that, on average, 
hygiene scores were 1.77 (out of 4) on day 1 of the festival, but went down to 0.96 and 
0.98 on days 2 and 3, respectively. The other important measures for our purposes are the 
skew and the kurtosis (see section 1.7.1). The values of skew and kurtosis should be zero 
in a normal distribution. Positive values of skew indicate a pile-up of scores on the left of 
the distribution, whereas negative values indicate a pile-up on the right. Positive values of 
kurtosis indicate a pointy and heavy-tailed distribution, whereas negative values indicate 
a flat and light-tailed distribution. The further the value is from zero, the more likely it is 
that the data are not normally distributed. For day 1 the skew value is very close to zero 
(which is good) and kurtosis is a little negative. For days 2 and 3, though, there is a skew 
of around 1 (positive skew). 

Although the values of skew and kurtosis are informative, we can convert these values 
to z-scores. We saw in section 1.7.4 that a z-score is simply a score from a distribution 
that has a mean of 0 and a standard deviation of 1. We also saw that this distribution has 
known properties that we can use. Converting scores to a z-score can be useful (if treated 
with suitable caution) because (1) we can compare skew and kurtosis values in different 
samples that used different measures, and (2) we can see how likely our values of skew and 
kurtosis are to occur. To transform any score to a z-score you simply subtract the mean of 
the distribution (in this case zero) and then divide by the standard deviation of the distribu¬ 
tion (in this case we use the standard error). Skew and kurtosis are converted to z-scores 
in exactly this way. 

5-0 

z —- 

skewness 077 

■^skewness 


K-0 

SF 


5 The observant will notice that the values of kurtosis differ, this is because describe() produces an unbiased esti¬ 
mate (DeCarlo, 1997) whereas stat.descQ produces a biased one. 
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In the above equations, the values of S (skew) and K (kurtosis) and their respective stan¬ 
dard errors are produced by R. These ^-scores can be compared against values that you 
would expect to get by chance alone (i.e., known values for the normal distribution shown 
in the Appendix). So, an absolute value greater than 1.96 is significant at p < .05, above 
2.58 is significant at p < .01, and above 3.29 is significant at p < .001. Large samples 
will give rise to small standard errors and so when sample sizes are big, significant values 
arise from even small deviations from normality. In smallish samples it’s OK to look for 
values above 1.96; however, in large samples this criterion should be increased to the 2.58 
one and in very large samples, because of the problem of small standard errors that I’ve 
described, no criterion should be applied. If you have a large sample (200 or more) it is 
more important to look at the shape of the distribution visually and to look at the value of 
the skew and kurtosis statistics rather than calculate their significance. 


var n 

mean 

sd median 

trimmed mad min 

max 

range skew 

kurtosis se 

1 809 

1.77 

0.69 1.79 

1.77 

0.70 0.02 

3.69 

3.67 0.00 

-0.41 

0.02 

2 264 

0.96 

0.72 0.79 

0.87 

0.61 0.00 

3.44 

3.44 1.08 

0.82 

0.04 

3 123 

0.98 

0.71 0.76 

0.90 

0.61 0.02 

3.41 

3.39 1.01 

0.73 

0.06 

Output 

5.1 










dayl 

day2 


day3 



median 


1.790000000 7. 

900000e-01 7 

.600000e-01 




mean 
SE.mean 
Cl.mean.0 
var 

std.dev 
coef.var 
skewness 
skew.2SE 
kurtosis 
kurt.2SE 
normtest.W 
normtest.p 


95 


1.770828183 
0.024396670 
0.047888328 
0.481514784 
0.693912663 
0.391857702 
-0.003155393 
-0.018353763 
-0.423991408 
-1.234611514 
0.995907247 
0.031846386 


9.609091e-01 
4.436095e-02 
8.734781e-02 
5.195239e-01 
7.207801e-01 
7.501022e-01 
1.082811e+00 
3.611574e+00 
7.554615e-01 
1.264508e+00 
9.083185e-01 
1.281495e-ll 


765041e-01 

404352e-02 

267805e-01 

044934e-01 

102770e-01 

273672e-01 

007813e+00 

309035e+00 

945454e-01 

862946e-01 

077513e-01 

804334e-07 


Output 5.2 


The stat.desc() function produces sk.ew.2SE and kurt.lSE, which are the skew and kur¬ 
tosis value divided by 2 standard errors. Remember that z is significant if it is greater than 
2 (well, 1.96), therefore this statistic is simply the equations above in a slightly different 
format. We have said that if the skew divided by its standard error is greater than 2 then it 
is significant (at p < .05), which is the same as saying that if the skew divided by 2 times 
the standard error is greater than 1 then it is significant (at p < .05). In other words, if 
skew.lSE or kurt.lSE are greater than 1 (ignoring the plus or minus sign) then you have 
significant skew/kurtosis (at p < .05); values greater than 1.29 indicate significance at p 
< .01, and above 1.65 indicate significance at p < .001. However, as I have just said, you 
would only use this criterion in fairly small samples so you need to interpret these values 
of skew.lSE or kurt.lSE cautiously. 

For the hygiene scores, the values of skew.lSE are -0.018, 3.612, and 2.309 for days 1, 
2 and 3 respectively, indicating significant skew on days 2 and 3; the values of kurt.lSE 
are —1.235, 1.265, and 0.686, indicating significant kurtosis on days 1 and 2, but not day 
3. However, bear in mind what I just said about large samples because our sample size is 
pretty big so the histograms are better indicators of the shape of the distribution. 

The output of stat.desc() also gives us the Shapiro-Wilk test of normality, which we look 
at in some detail in section 5.6. For the time being, just note that the test and its probability 
value can be found in Output 5.2 labelled as normtest.W and normtest.p. 
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1 

7 

ih 


R’s Souls’ Tip 5.2 


1 


Changing how many decimal places are 
displayed in your output 0 


Output 5.2 looks pretty horrible because of all of the decimal places and the scientific notation (i.e., 7.900000e- 
01). Most of this precision is unnecessary for everyday purposes. However, we can easily convert our output 
using the round() function. This function takes the general form: 

roundCobject that we want to round, digits = x) 

Therefore, we can stick an object into this function and then set digits to be the number of decimal places that we 
want. For example, if we wanted Output 5.2 to be displayed to 3 decimal places we could execute: 

round(stat.descCdlf[, c("dayl", "day2", "day3")], basic = FALSE, norm = TRUE), digits 
= 3 ) 

Note that we have simply placed the original command ( stat.desc(dlf[, c(“day1 ”, “day2", “day3”)], basic = FALSE, 
norm = TRUE)) within the round() function, and then set digits to be 3. The result is a more palatable output: 



dayl 

day2 

day3 

median 

1 

.790 

0 

.790 

0 

.760 

mean 

1 

. 771 

0 

.961 

0 

. 977 

SE.mean 

0 

. 024 

0 

. 044 

0 

.064 

Cl.mean.0.95 

0 

.048 

0 

.087 

0 

. 127 

var 

0 

. 482 

0 

.520 

0 

.504 

std.dev 

0 

. 694 

0 

.721 

0 

.710 

coef.var 

0 

.392 

0 

.750 

0 

. 727 

skewness 

-0 

.003 

1 

.083 

1 

.008 

skew.2SE 

-0 

.018 

3 

. 612 

2 

.309 

kurtosis 

-0 

.424 

0 

.755 

0 

.595 

kurt.2SE 

-1 

.235 

1 

.265 

0 

. 686 

normtest.W 

0 

.996 

0 

.908 

0 

.908 

normtest.p 

0 

.032 

0 

.000 

0 

.000 



CRAMMING SAM’S TIPS 


Skew and kurtosis 


• To check that the distribution of scores is approximately normal, we need to look at the values of skew and kurtosis in the 
output. 

• Positive values of skew indicate too many low scores in the distribution, whereas negative values indicate a build-up of high 
scores. 

• Positive values of kurtosis indicate a pointy and heavy-tailed distribution, whereas negative values indicate a flat and light¬ 
tailed distribution. 

• The further the value is from zero, the more likely it is that the data are not normally distributed. 

• You can test the significance of these values of skew and kurtosis, but these tests should not be used in large samples 
(because they are likely to be significant even when skew and kurtosis are not too different from normal). 
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5.5.3. 


Exploring groups of data © 


Sometimes we have data in which there are different groups of entities (cats 
and dogs, different universities, people with depression and people without, 
for example). There are several ways to produce basic descriptive statistics for 
separate groups of people (and we will come across some of these methods in 
section 5.6.1). However, I intend to use this opportunity to introduce you to 
the by() function and reintroduce the subset() function from Chapter 3. These 
functions allow you to specify a grouping variable which splits the data, or to 
select a subset of cases. 

You’re probably getting sick of the hygiene data from the Download Festival 
so let’s use the data in the file RExam.dat. This file contains data regarding stu¬ 
dents’ performance on an R exam. Four variables were measured: exam (first- 
year R exam scores as a percentage), computer (measure of computer literacy 
as a percentage), lecture (percentage of R lectures attended) and numeracy (a 
measure of numerical ability out of 15). There is a variable called uni indicating whether 
the student attended Sussex University (where I work) or Duncetown University. Let’s 
begin by looking at the data as a whole. 


5.5.3.I. Running the analysis for all data © 





To begin with, open the file RExam.dat by executing: 

rexam <- read.delim("rexam.dat", header=TRUE) 

The variable uni will have loaded in as numbers rather than as text, because that was 
how it was specified in the data file; therefore, we need to set the variable uni to be a factor 
by executing (see section 3.5.4.3): 

rexam$uni<-factor(rexam$uni, levels = c(0:l), labels = c("Duncetown 

University", "Sussex University")) 

Remember that this command takes the variable uni from the rexam dataframe ( rexam$uni ), 
specifies the numbers used to code the two universities, 0 and 1 ( levels = c(0:l)), and then 
assigns labels to them so that 0 represents Duncetown University, and 1 represents Sussex 
University ( labels = c(“Duncetown University”, “Sussex University”)). 



SELF-TEST 

s Using what you have learnt so far, obtain descriptive 
statistics and draw histograms of first-year exam 
scores, computer literacy, numeracy and lectures 
attended. 


Assuming you completed the self-test, you should see something similar to what’s in 
Output 5.3 (I used stat.desc()) and Figure 5.3. From Output 5.3, we can see that, on 
average, students attended nearly 60% of lectures, obtained 58% in their R exam, 
scored only 51% on the computer literacy test, and only 4.85 out of 15 on the numer¬ 
acy test. In addition, the standard deviation for computer literacy was relatively small 
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FIGURE 5.3 

Histograms 
of the R exam 
data 



First Year Exam Score 



Computer Literacy 




Percentage of Lectures Attended Numeracy 


compared to that of the percentage of lectures attended and exam scores. The other 
important measures are the skew and the kurtosis, and their associated tests of sig¬ 
nificance. We came across these measures earlier on and found that we can interpret 
absolute values of ku.rt.2SE and skew.lSE greater than 1, 1.29, and 1.65 as significant 
p < .05, p < .01, and p < .001, respectively. We can see that for skew, numeracy scores 
are significantly positively skewed (p < .001) indicating a pile-up of scores on the left 
of the distribution (so most students got low scores). For kurtosis, prior exam scores 
are significant (p < .05). 

The histograms show us several things. The exam scores are very interesting because this 
distribution is quite clearly not normal; in fact, it looks suspiciously bimodal (there are two 
peaks, indicative of two modes). This observation corresponds with the earlier informa¬ 
tion from the table of descriptive statistics. It looks as though computer literacy is fairly 
normally distributed (a few people are very good with computers and a few are very bad, 
but the majority of people have a similar degree of knowledge), as is the lecture attendance. 
Finally, the numeracy test has produced very positively skewed data (i.e., the majority of 
people did very badly on this test and only a few did well). This corresponds to what the 
skew statistic indicated. 
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exam computer lectures numeracy 


median 

60. 

.000 

51. 

.500 

62 . 

.000 

4 . 

.000 

mean 

58. 

.100 

50 . 

.710 

59. 

.765 

4 . 

.850 

SE.mean 

2 . 

. 132 

0 . 

.826 

2 . 

.168 

0 . 

.271 

Cl.mean.0.95 

4 . 

.229 

1. 

. 639 

4 

.303 

0 . 

.537 

var 

454 . 

.354 

68. 

.228 

470 . 

.230 

7 . 

.321 

std.dev 

21. 

.316 

8. 

.260 

21. 

.685 

2 . 

.706 

coef.var 

0. 

.367 

0 . 

.163 

0. 

.363 

0 . 

.558 

skewness 

-0. 

.104 

-0 . 

.169 

-0. 

.410 

0 . 

.933 

skew.2SE 

-0. 

.215 

-0 . 

.350 

-0. 

.849 

1. 

.932 

kurtosis 

-1. 

. 148 

0 . 

.221 

-0. 

.285 

0 . 

.763 

kurt.2SE 

-1. 

.200 

0 . 

.231 

-0. 

.298 

0 . 

.798 

normtest.W 

0 . 

.961 

0 . 

.987 

0 . 

. 977 

0 . 

. 924 

normtest.p 

0 . 

.005 

0 . 

. 441 

0 . 

. 077 

0 . 

.000 


Output 5.3 


Descriptive statistics and histograms are a good way of getting an instant picture of the 
distribution of your data. This snapshot can be very useful: for example, the bimodal distri¬ 
bution of R exam scores instantly indicates a trend that students are typically either very good 
at statistics or struggle with it (there are relatively few who fall in between these extremes). 
Intuitively, this finding fits with the nature of the subject: statistics is very easy once every¬ 
thing falls into place, but before that enlightenment occurs it all seems hopelessly difficult. 


5.5.3.2. Running the analysis for different groups © 


If we want to obtain separate descriptive statistics for each of the universities, we can use 
the by() function. 6 The by() function takes the general form: 

byCdata = dataFrame, INDICES = grouping variable, FUN = a function that you 
want to apply to the data) 

In other words, we simply enter the name of our dataframe or variables that we’d like to anal¬ 
yse, we specify a variable by which we want to split the output (in this case, it’s uni, because we 
want separate statistics for each university), and we tell it which function we want to apply to 
the data (in this case we could use describe or stat.desc). Therefore, to get descriptive statistics 
for the variable exam for each university separately using describe, we could execute: 

byCdata = rexam$exam, INDICES = rexam$uni, FUN = describe) 

To do the same, but using stat.desc() instead of describeQ we could execute: 
byCdata = rexam$exam, INDICES = rexam$uni, FUN = stat.desc) 

In both cases, we can get away with not explicitly using data, INDICES and FUN as long 
as we order the variables in the order in the functions above; so, these commands have the 
same effect as those above: 

by(rexam$exam, rexam$uni, describe) 
by(rexam$exam, rexam$uni, stat.desc) 

Finally, you can include any options for the function you’re using by adding them in at the 
end; for example, if you’re using stat.desc() you can specify not to have basic statistics and 
to have normality statistics by including those options: 

by(rexam$exam, rexam$uni, stat.desc, basic = FALSE, norm = TRUE) 


6 by() is what is known as a ‘wrapper’ function - that is, it takes a more complicated function and simplifies it 
for people like me. by() is a wrapper for a very powerful and clever function, called tapply(), which can do all 
sorts of things, but is harder to use, so we use by() instead, which just takes our commands and turns them into 
commands for tapplyQ. 
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If we want descriptive statistics for multiple variables, then we can use cbind() (see R’s 
Souls’ Tip 3.5) to include them within the by() function. For example, to look at the 
descriptive statistics of both the previous R exam and the numeracy test, we could execute: 

by(cbind(data=rexam$exam,data=rexam$numeracy), rexam$uni, describe) 

or 

by(rexam[, c("exam", "numeracy")], rexam$uni, stat.desc, basic = FALSE, 
norm = TRUE) 

Note that the resulting Output 5.4 (which was created using describe rather than 
stat.desc) is split into two sections: first the results for students at Duncetown 
University, then the results for those attending Sussex University. From these tables it 
is clear that Sussex students scored higher on both their R exam (called VI here) and 
the numeracy test than their Duncetown counterparts. In fact, looking at the means 
reveals that, on average, Sussex students scored an amazing 36% more on the R exam 
than Duncetown students, and had higher numeracy scores too (what can I say, my 
students are the best). 

INDICES: Duncetown University 

var n mean sd median trimmed mad min max range skew kurtosis se 

VI 1 50 40.18 12.59 38 39.85 12.60 15 66 51 0.29 -0.57 1.78 

V2 2 50 4.12 2.07 4 4.00 2.22 19 8 0.48 -0.48 0.29 


INDICES: Sussex University 

var n mean sd median trimmed mad min max range skew kurtosis se 

VI 1 50 76.02 10.21 75 75.70 8.90 56 99 43 0.26 -0.26 1.44 

V2 2 50 5.58 3.07 5 5.28 2.97 1 14 13 0.75 0.26 0.43 

Output 5.4 

Next, we’ll look at the histograms. It might be possible to use by() with ggplot2() to draw 
histograms, but if it is the command will be so complicated that no one will understand it. 
A simple way, therefore, to create plots for different groups is to use the subset() function, 
which we came across in Chapter 3 (section 3.9.2) to create an object containing only the 
data in which we’re interested. For example, if we wanted to create separate histograms for 
the Duncetown and Sussex Universities then we could create new dataframes that contain 
data from only one of the two universities. For example, execute: 

dunceData<-subset(rexam, rexam$uni=="Duncetown University") 
sussexData<-subsetCrexam, rexam$uni=="Sussex University") 

These commands each create a new dataframe that is based on a subset of the rexam 
dataframe; the subset is determined by the condition in the function. The first command 
contains the condition rexam$uni==“Duncetown University ”, which means that if the 
value of the variable uni is exactly equal to the phrase “Duncetown University” then 
the case is selected. In other words, it will retain all cases for which uni is Duncetown 
University. Therefore, I’ve called the resulting dataframe dunceData. The second com¬ 
mand does the same thing but this time specifies that uni must be exactly equal to the 
phrase ‘Sussex University’. The resulting dataframe, sussexData, contains only the Sussex 
University scores. This is a quick and easy way to split groups; however, you need to be 
careful that the term you specify to select cases (e.g., ‘Duncetown University’) exactly 
matches (including capital letters and spaces) the labelling in the data set otherwise you’ll 
end up with an empty data set. 

Having created our separate dataframes, we can generate histograms using the same 
commands as before, but specifying the dataframe for the subset of data. For example, to 
create a histogram of the numeracy scores for Duncetown University, we could execute: 
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hist.numeracy.duncetown <- ggplot(dunceData, aes(numeracy)) + opts(legend. 
position = "none") + geom_histogram(aes(y = ..density..), fill = "white", 
colour = "black", binwidth = 1) + labs(x = "Numeracy Score", y = "Density") 
+ stat_function(fun=dnorm, args=list(mean = mean(dunceData$numeracy, 
na.rm = TRUE), sd = sd(dunceData$numeracy, na.rm = TRUE)), colour = "blue", 
size=l) 

hist.numeracy.duncetown 

Compare this code with that in section 5.5.1; note that it is exactly the same, but we have 
used the dunceData dataframe instead of using the whole data set. 7 We could create the 
same plot for the Sussex University students by simply using sussexData in place of dunce¬ 
Data in the command. 

We could repeat these commands for the exam scores by replacing ‘numeracy’ with 
‘exam’ throughout the commands above (this will have the effect of plotting exam scores 
rather than numeracy scores). Figure 5.4 shows the histograms of exam scores and numer¬ 
acy split according to the university attended. The first interesting thing to note is that for 
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FIGURE 5.4 

Distributions 
of exam and 
numeracy scores 
for Duncetown 
University and 
Sussex University 
students 




7 Note that I have included ‘binwidth = 1’ (see Chapter 4) for the numeracy scores because it makes the result¬ 
ing plot look better; for the other variables this option can be excluded because the default bin width produces 
nice-looking plots. 
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exam marks the distributions are both fairly normal. This seems odd because the overall 
distribution was bimodal. However, it starts to make sense when you consider that for 
Duncetown the distribution is centred on a mark of about 40%, but for Sussex the distri¬ 
bution is centred on a mark of about 76%. This illustrates how important it is to look at 
distributions within groups. If we were interested in comparing Duncetown to Sussex it 
wouldn’t matter that overall the distribution of scores was bimodal; all that’s important is 
that each group comes from a normal distribution, and in this case it appears to be true. 
When the two samples are combined, these two normal distributions create a bimodal 
one (one of the modes being around the centre of the Duncetown distribution, and the 
other being around the centre of the Sussex data). For numeracy scores, the distribution is 
slightly positively skewed (there is a larger concentration at the lower end of scores) in both 
the Duncetown and Sussex groups. Therefore, the overall positive skew observed before is 
due to the mixture of universities. 



SELF-TEST 

s Repeat these analyses for the computer literacy and 
percentage of lectures attended and interpret the 
results. 


5.6. Testing whether a distribution is normal © 

Another way of looking at the problem is to see whether the distribution as 
a whole deviates from a comparable normal distribution. The Shapiro-Wilk 
test does just this: it compares the scores in the sample to a normally dis¬ 
tributed set of scores with the same mean and standard deviation. If the test 
is non-significant (p > .05) it tells us that the distribution of the sample is 
not significantly different from a normal distribution. If, however, the test is 
significant (p < .05) then the distribution in question is significantly differ¬ 
ent from a normal distribution (i.e., it is non-normal). This test seems great: 
in one easy procedure it tells us whether our scores are normally distributed 
(nice!). However, it has limitations because with large sample sizes it is very 
easy to get significant results from small deviations from normality, and so a 
significant test doesn’t necessarily tell us whether the deviation from normality is enough 
to bias any statistical procedures that we apply to the data. I guess the take-home message 
is: by all means use these tests, but plot your data as well and try to make an informed 
decision about the extent of non-normality. 



Doing the Shapiro-Wilk test in R © 


We have already encountered the Shapiro-Wilk test as part of the output from the stat. 
descQ function (see Output 5.2 and, for these data, Output 5.3). However, we can also use 
the shapiro.test() function. This function takes the general form: 

shapiro.test(variable) 
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in which variable is the name of the variable that you’d like to test for normality. Therefore, 
to test the exam and numeracy variables for normality we would execute: 

shapiro.testf rexamSexam) 
shapiro.testf rexam$numeracy) 

The output is shown in Output 5.5. Note that the value of W corresponds to the value 
of normtest.W, and the p-value corresponds to normtest.p from the stat.desc() function 
(Output 5.3). For each test we see the test statistic, labelled W, and the p-value. Remember 
that a significant value (p-value less than .05) indicates a deviation from normality. For 
both numeracy (p = .005) and R exam scores (p < .001), the Shapiro-Wilk test is highly 
significant, indicating that both distributions are not normal. This result is likely to reflect 
the bimodal distribution found for exam scores, and the positively skewed distribution 
observed in the numeracy scores. Flowever, these tests confirm that these deviations were 
significant (but bear in mind that the sample is fairly big). 

Shapiro-Wilk normality test 

data: rexam$exam 

W = 0.9613, p-value = 0.004991 

Shapiro-Wilk normality test 

data: rexam$numeracy 

W = 0.9244, p-value = 2.424e-05 

Output 5.5 

As a final point, bear in mind that when we looked at the exam scores for separate 
groups, the distributions seemed quite normal; now if we’d asked for separate Shapiro- 
Wilk tests for the two universities we might have found non-significant results. In fact, let’s 
try this out, using the by() function we came across earlier. We use sbapiro.test as the FUN 
instead of describe or stat.desc, which we have used before (although stat.desc would also 
give you the Shapiro-Wilk test as part of the output so you could use this function also): 

by(rexam$exam, rexam$uni, shapiro.test) 
by(rexam$numeracy, rexam$uni, shapiro.test) 

You should get Output 5.6 for the exam scores, which shows that the percentages on the 
R exam are indeed normal within the two groups (the p-values are greater than .05). This 
is important because if our analysis involves comparing groups, then what’s important is 
not the overall distribution but the distribution in each group. 

rexam$uni: Duncetown University 

Shapiro-Wilk normality test 
data: dd[x, ] 

W = 0.9722, p-value = 0.2829 


rexam$uni: Sussex University 

Shapiro-Wilk normality test 
data: dd[x, ] 

W = 0.9837, p-value = 0.7151 


Output 5.6 
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For numeracy scores (Output 5.7) the tests are still significant indicating non-normal 
distributions both for Duncetown University (p = .015), and Sussex University (p = .007). 

rexam$uni: Duncetown University 

Shapiro-Wilk normality test 
data: dd[x, ] 

W = 0.9408, p-value = 0.01451 


rexam$uni: Sussex University 

Shapiro-Wilk normality test 
data: dd[x, ] 

W = 0.9323, p-value = 0.006787 

Output 5.7 

We can also draw Q-Q plots for the variables, to help us to interpret the results of the 
Shapiro-Wilk test (see Figure 5.5). 

qplot(sample = rexam$exam, stat="qq") 
qplot(sample = rexam$numeracy, stat="qq") 

The normal Q-Q chart plots the values you would expect to get if the distribution were 
normal (theoretical values) against the values actually seen in the data set (sample values). 
If the data are normally distributed, then the observed values (the dots on the chart) should 
fall exactly along a straight line (meaning that the observed values are the same as you 
would expect to get from a normally distributed data set). Any deviation of the dots from 
the line represents a deviation from normality. So, if the Q-Q plot looks like a wiggly snake 
then you have some deviation from normality. Specifically, when the line sags consistently 
below the diagonal, or consistently rises above it, then this shows that the kurtosis differs 
from a normal distribution, and when the curve is S-shaped, the problem is skewness. 

In both of the variables analysed we already know that the data are not normal, and these 
plots (see Figure 5.5) confirm this observation because the dots deviate substantially from 
the line. It is noteworthy that the deviation is greater for the numeracy scores, and this is 
consistent with the higher significance value of this variable on the Shapiro-Wilk test. 


FIGURE 5.5 
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5 . 6 . 2 . 


Reporting the Shapiro-Wilk test 


© 


The test statistic for the Shapiro-Wilk test is denoted by W; we can report the results in 
Output 5.5 in the following way: 

•S The percentage on the R exam, W = 0.96, p = .005, and the numeracy scores, W = 
0.92, p < .001, were both significantly non-normal. 



CRAMMING SAM’S TIPS 


Normality tests 


• The Shapiro-Wilk test can be used to see if a distribution of scores significantly differs from a normal distribution. 

• If the Shapiro-Wilk test is significant (p-value less than .05) then the scores are significantly different from a normal 
distribution. 

• Otherwise, scores are approximately normally distributed. 

• Warning: In large samples this test can be significant even when the scores are only slightly different from a normal dis¬ 
tribution. Therefore, they should always be interpreted in conjunction with histograms, or Q-Q plots, and the values of skew 
and kurtosis. 


5.7. Testing for homogeneity of variance © 


So far I’ve concentrated on the assumption of normally distributed data; however, at the 
beginning of this chapter I mentioned another assumption: homogeneity of variance. This 
assumption means that as you go through levels of one variable, the variance of the other 
should not change. If you’ve collected groups of data then this means that the variance of 
your outcome variable or variables should be the same in each of these groups. If you’ve 
collected continuous data (such as in correlational designs), this assumption means that the 
variance of one variable should be stable at all levels of the other variable. Let’s illustrate 
this with an example. An audiologist was interested in the effects of loud concerts on peo¬ 
ple’s hearing. So, she decided to send 10 people on tour with the loudest band she could 
find, Motorhead. These people went to concerts in Brixton (London), Brighton, Bristol, 
Edinburgh, Newcastle, Cardiff and Dublin and after each concert the audiologist measured 
the number of hours after the concert that these people had ringing in their ears. 

Figure 5.6 shows the number of hours that each person had ringing in his or her ears 
after each concert (each person is represented by a circle). The horizontal lines represent 
the average number of hours that there was ringing in the ears after each concert and these 
means are connected by a line so that we can see the general trend of the data. Remember 
that for each concert, the circles are the scores from which the mean is calculated. Now, we 
can see in both graphs that the means increase as the people go to more concerts. So, after 
the first concert their ears ring for about 12 hours, but after the second they ring for about 
15-20 hours, and by the final night of the tour, they ring for about 45-50 hours (2 days). 
So, there is a cumulative effect of the concerts on ringing in the ears. This pattern is found 
in both graphs; the difference between the graphs is not in terms of the means (which are 
roughly the same), but in terms of the spread of scores around the mean. If you look at the 
left-hand graph, the spread of scores around the mean stays the same after each concert 
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FIGURE 5.6 

Graphs illustrating 
data with 
homogeneous 
(left) and 
heterogeneous 
(right) variances 


(the scores are fairly tightly packed around the mean). Put it another way, if you measured 
the vertical distance between the lowest score and the highest score after the Brixton con¬ 
cert, and then did the same after the other concerts, all of these distances would be fairly 
similar. Although the means increase, the spread of scores for hearing loss is the same at 
each level of the concert variable (the spread of scores is the same after Brixton, Brighton, 
Bristol, Edinburgh, Newcastle, Cardiff and Dublin). This is what we mean by homogeneity 
of variance. The right-hand graph shows a different picture: if you look at the spread of 
scores after the Brixton concert, they are quite tightly packed around the mean (the vertical 
distance from the lowest score to the highest score is small), but after the Dublin show (for 
example) the scores are very spread out around the mean (the vertical distance from the 
lowest score to the highest score is large). This is an example of heterogeneity of variance: 
that is, at some levels of the concert variable the variance of scores is different than other 
levels (graphically, the vertical distance from the lowest to highest score is different after 
different concerts). 




Levene’s test © 


Hopefully you’ve got a grip of what homogeneity of variance actually means. Now, how 
do we test for it? Well, we could just look at the values of the variances and see whether 
they are similar. However, this approach would be very subjective and probably prone to 
academics thinking ‘Ooh look, the variance in one group is only 3000 times larger than the 
variance in the other: that’s roughly equal’. Instead, in correlational analysis such as regres¬ 
sion we tend to use graphs (see section 7.9.5) and for groups of data we tend to use a test 
called Levene’s test (Levene, 1960). Levene’s test tests the null hypothesis that the variances 
in different groups are equal (i.e., the difference between the variances is zero). It’s a very 
simple and elegant test that works by doing a one-way ANOVA (see Chapter 10) conducted 
on the deviation scores; that is, the absolute difference between each score and the mean of 
the group from which it came (see Glass, 1966, for a very readable explanation). 8 For now, 
all we need to know is that if Levene’s test is significant at p < .05 then we can conclude that 
the null hypothesis is incorrect and that the variances are significantly different - therefore, 
the assumption of homogeneity of variances has been violated. If, however, Levene’s test 

8 We haven’t covered ANOVA yet, so this explanation won’t make much sense to you now, but in Chapter 10 we 
will look in more detail at how Levene’s test works. 
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is non-significant (i.e., p > .05) then the variances are roughly equal and the assumption 
is tenable. 


5.7.I.I. Levene’s test with R Commander © 


First we’ll load the data into R Commander. Choose Data => Import data => from text file, 
clipboard, or URL... and then select the file RExam.dat (see section 3.7.3). Before we can 
conduct Levene’s test we need to convert uni to a factor because at the moment it is simply 
Os and Is so R doesn’t know that it’s a factor-see section 3.6.2 to remind yourself how to do 
that. Once you have done this you should be able to select Statistics=t>Variances=>Levene’s 
test (you won’t be able to select it unless R can ‘see’ a factor in the dataframe). Choosing 
this option in the menu opens the dialog box shown in Figure 5.7. You need to select a 
grouping variable. R Commander has realized that you only have one variable that could 
be the grouping variable - because it is the only factor - and that’s uni. Therefore, it has 
already selected this variable. 

Choose the variable on the right that you want to test for equality of variances across 
the groups defined by uni. You can choose median or mean for the centring - the median 
tends to be more accurate and is the default; I use this default throughout the book. Run 
the analysis for both exam and numeracy. Output 5.8 shows the results. 


FIGURE 5.7 

Levene’s test in 
R Commander 




74 Levene'sTest ^ 

1 |d| s k3-i 


Groups (pick one) Response Variable (pick one) 



Center 
median (0) 
mean © 



OK 

Cancel 


Help 


5.7.I.2. Levene’s test with R © 


To use Levene’s test, we use the leveneTest() function from the car package. This function 
takes the general form: 

leveneTestCoutcome variable, group, center = median/mean) 
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Therefore, we enter two variables into the function: first the outcome variable of which we 
want to test the variances; and second, the grouping variable, which must be a factor. We 
can just enter these variables and Levene’s test will centre the variables using the median 
(which is slightly preferable), but if we want to override this default and centre using the 
mean then we can add the option center = “mean”. Therefore, for the exam scores we 
could execute: 

leveneTest(rexam$exam, rexam$uni) 
leveneTest(rexam$exam, rexam$uni, center = mean) 

For the numeracy scores we would execute (note that all we have changed is the outcome 
variable): 

leveneTest(rexam$numeracy, rexam$uni) 


5.7.I.3. Levene’s test output © 

Output 5.8 shows the output for Levene’s test for exam scores (using the median), exam 
scores (centring using the mean) and numeracy scores. The result is non-significant for the 
R exam scores (the value in the Pr (>F) column is more than .05) regardless of whether 
we centre with the median or mean. This indicates that the variances are not significantly 
different (i.e., they are similar and the homogeneity of variance assumption is tenable). 
However, for the numeracy scores, Levene’s test is significant (the value in the Pr (>F) 
column is less than .05) indicating that the variances are significantly different (i.e., they 
are not the same and the homogeneity of variance assumption has been violated). 

> leveneTest(rexam$exam, rexam$uni) 

Levene's Test for Homogeneity of Variance (center = median) 

Df F value Pr(>F) 
group 1 2.0886 0.1516 

98 

> leveneTest(rexam$exam, rexam$uni, center = mean) 

Levene's Test for Homogeneity of Variance (center = mean) 

Df F value Pr(>F) 
group 1 2.5841 0.1112 

98 

> leveneTest(rexam$numeracy, rexam$uni) 

Levene's Test for Homogeneity of Variance (center = median) 

Df F value Pr(>F) 
group 1 5.366 0.02262 * 

98 

Output 5.8 



Reporting Levene’s test © 


Levene’s test can be denoted with the letter F and there are two different degrees of free¬ 
dom. As such you can report it, in general form, as T(dfl, df2) = value, Pr (>F). So, for the 
results in Output 5.8 we could say: 
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•S For the percentage on the R exam, the variances were similar for Duncetown and 
Sussex University students, F( 1, 98) = 2.09, ns, but for numeracy scores the variances 
were significantly different in the two groups, F( 1, 98) = 5.37, p = .023. 



Hartley’s F : the variance ratio © 

’ max 


As with the Shapiro-Wilk test (and other tests of normality), when the sample size is 
large, small differences in group variances can produce a Levene’s test that is significant 
(because, as we saw in Chapter 1, the power of the test is improved). A useful double 
check, therefore, is to look at Hartley’s F max - also known as the variance ratio (Pearson 
& Hartley, 1954). This is the ratio of the variances between the group with the biggest 
variance and the group with the smallest variance. This ratio was compared to critical 
values in a table published by Hartley. Some of the critical values (for a .05 level of sig¬ 
nificance) are shown in Figure 5.8 (see Oliver Twisted); as you can see, the critical values 
depend on the number of cases per group (well, n — 1 actually), and the number of vari¬ 
ances being compared. From this graph you can see that with sample sizes ( n ) of 10 per 
group, an F of less than 10 is more or less always going to be non-significant, with 
15-20 per group the ratio needs to be less than about 5, and with samples of 30-60 the 
ratio should be below about 2 or 3. 


Number of Variances being Compared 



FIGURE 5.8 

Selected critical 
values for 
Hartley’s F max test 



OLIVE TWISTED Oliver thinks that my graph of critical values is stupid. ‘Look at that graph,’ 

he laughed, ‘it’s the most stupid thing I’ve ever seen since I was at Sussex 
Please Sir, can i have Uni and I saw my statistics lecturer, Andy Fie...’. Well, go choke on your 
Some more ... Hartley’s gruel you Dickensian bubo, because the full table of critical values is 
' max? in the additional material for this chapter on the companion website. 




















190 


DISCOVERING STATISTICS USING R 



CRAMMING SAM’S TIPS 


Homogeneity of variance 


• Homogeneity of variance is the assumption that the spread of scores is roughly equal in different groups of cases, or more 
generally that the spread of scores is roughly equal at different points on the predictor variable. 

• When comparing groups, this assumption can be tested with Levene's test. 

• If Levene's test is significant ( Pr(>F) in the R output is less than .05) then the variances are significantly different in different 
groups. 

• Otherwise, homogeneity of variance can be assumed. 

• The variance ratio is the largest group variance divided by the smallest. This value needs to be smaller than the critical values 
in Figure 5.8. 

• Warning: In large samples Levene’s test can be significant even when group variances are not very different. Therefore, it 
should be interpreted in conjunction with the variance ratio. 


5.8. Correcting problems in the data © 


The previous section showed us various ways to explore our data; we saw how to look for 
problems with our distribution of scores and how to detect heterogeneity of variance. In 
Chapter 4 we also discovered how to spot outliers in the data. The next question is what 
to do about these problems. 


5 . 8 . 1 . 


Dealing with outliers © 


If you detect outliers in the data there are several options for reducing the impact of these 
values. However, before you do any of these things, it’s worth checking that the data have 
been entered correctly for the problem cases. If the data are correct then the three main 
options you have are: 

1 Remove the case: This entails deleting the data from the person who contributed the 
outlier. However, this should be done only if you have good reason to believe that 
this case is not from the population that you intended to sample. For example, if 
you were investigating factors that affected how much cats purr and one cat didn’t 
purr at all, this would likely be an outlier (all cats purr). Upon inspection, if you dis¬ 
covered that this cat was actually a dog wearing a cat costume (hence why it didn’t 
purr), then you’d have grounds to exclude this case because it comes from a different 
population (dogs who like to dress as cats) than your target population (cats). 

2 Transform the data: Outliers tend to skew the distribution and, as we will see in the 
next section, this skew (and, therefore, the impact of the outliers) can sometimes be 
reduced by applying transformations to the data. 

3 Change the score: If transformation fails, then you can consider replacing the score. 
This on the face of it may seem like cheating (you’re changing the data from what 
was actually corrected); however, if the score you’re changing is very unrepresenta¬ 
tive and biases your statistical model anyway then changing the score is the lesser of 
two evils! There are several options for how to change the score: 
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a The next highest score plus one: Change the score to be one unit above the next 
highest score in the data set. 

b Convert back from a z-score: A z-score of 3.29 constitutes an outlier (see Jane 
Superbrain Box 4.1), so we can calculate what score would give rise to a z-score 
of 3.29 (or perhaps 3) by rearranging the z-score equation in section 1.7.4, which 
gives us X = (z x s) + X. All this means is that we calculate the mean (X) and stand¬ 
ard deviation (s) of the data; we know that z is 3 (or 3.29 if you want to be exact) 
so we just add three times the standard deviation to the mean, and replace our 
outliers with that score. 

C The mean plus two standard deviations: A variation on the above method is to use 
the mean plus two times the standard deviation (rather than three times the stand¬ 
ard deviation). 


5 . 8 . 2 . 


Dealing with non-normality and unequal variances © 


5.8.2.I. Transforming data © 

This section is quite hair raising so don’t worry if it doesn’t make much sense - many 
undergraduate courses won’t cover transforming data so feel free to ignore this section if 
you want to. 

We saw in the previous section that you can deal with outliers by transforming the data 
and that these transformations are also useful for correcting problems with normality and 
the assumption of homogeneity of variance. The idea behind transformations 
is that you do something to every score to correct for distributional problems, 
outliers or unequal variances. Although some students often (understandably) 
think that transforming data sounds dodgy (the phrase ‘fudging your results’ 
springs to some people’s minds!), in fact it isn’t because you do the same thing 
to all of your scores. 9 As such, transforming the data won’t change the relation¬ 
ships between variables (the relative differences between people for a given 
variable stay the same), but it does change the differences between different 
variables (because it changes the units of measurement). Therefore, if you are 
looking at relationships between variables (e.g., regression) it is alright just to 
transform the problematic variable, but if you are looking at differences within 
variables (e.g., change in a variable over time) then you need to transform all 
levels of those variables. 

Let’s return to our Download Festival data (DownloadFestival.dat) from earlier in the 
chapter. These data were not normal on days 2 and 3 of the festival (section 5.4). Now, we 
might want to look at how hygiene levels changed across the three days (i.e., compare the 
mean on day 1 to the means on days 2 and 3 to see if people got smellier). The data for 
days 2 and 3 were skewed and need to be transformed, but because we might later compare 
the data to scores on day 1, we would also have to transform the day 1 data (even though 
scores were not skewed). If we don’t change the day 1 data as well, then any differences in 
hygiene scores we find from day 1 to day 2 or 3 will be due to us transforming one variable 
and not the others. 




9 Although there aren’t statistical consequences of transforming data, there may be empirical or scientific implica¬ 
tions that outweigh the statistical benefits (see Jane Superbrain Box 5.1). 
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Table 5.1 Data transformations and their uses 

Data Transformation 

Can Correct For 

Log transformation (log(X.)): Taking the logarithm of a set of numbers 
squashes the right tail of the distribution. As such it’s a good way to reduce 
positive skew. However, you can’t take the log of zero or negative numbers, 
so if your data tend to zero or produce negative numbers you need to add a 
constant to all of the data before you do the transformation. For example, if you 
have zeros in the data then do log(X. + 1), or if you have negative numbers add 
whatever value makes the smallest number in the data set positive. 

Positive skew, 
unequal variances 

Square root transformation (VX) : Taking the square root of large 
values has more of an effect than taking the square root of small values. 
Consequently, taking the square root of each of your scores will bring any 
large scores closer to the centre - rather like the log transformation. As 
such, this can be a useful way to reduce positive skew; however, you still 
have the same problem with negative numbers (negative numbers don’t 
have a square root). 

Positive skew, 
unequal variances 

Reciprocal transformation (1/X): Dividing f by each score also reduces 
the impact of large scores. The transformed variable will have a lower 
limit of 0 (very large numbers will become close to 0). One thing to bear 
in mind with this transformation is that it reverses the scores: scores that 
were originally large in the data set become small (close to zero) after 
the transformation, but scores that were originally small become big after 
the transformation. For example, imagine two scores of 1 and 10; after 
the transformation they become 1/1=1 and 1/10 = 0.1: the small score 
becomes bigger than the large score after the transformation. However, you 
can avoid this by reversing the scores before the transformation, by finding 
the highest score and changing each score to the highest score minus the 
score you’re looking at. So, you do a transformation 1/(X Highest -X.). 

Positive skew, 
unequal variances 

Reverse score transformations: Any one of the above transformations 
can be used to correct negatively skewed data, but first you have to reverse 
the scores. To do this, subtract each score from the highest score obtained, 
or the highest score + 1 (depending on whether you want your lowest 
score to be 0 or 1). If you do this, don’t forget to reverse the scores back 
afterwards, or to remember that the interpretation of the variable is reversed: 
big scores have become small and small scores have become big! 

Negative skew 


There are various transformations that you can do to the data that are helpful in correct¬ 
ing various problems. 10 However, whether these transformations are necessary or useful is 
quite a complex issue (see Jane Superbrain Box 5.1). Nevertheless, because they are used by 
researchers Table 5.1 shows some common transformations and their uses. 


5.8.2.2. Choosing a transformation (D 

Given that there are many transformations that you can do, how can you decide which one 
is best? The simple answer is trial and error: try one out and see if it helps and if it doesn’t 


10 You’ll notice in this section that I keep writing X.. We saw in Chapter 1 that this refers to the observed score for 
the z'th person (so, the i could be replaced with the name of a particular person, thus for Graham, X. = X Graham = 
Graham’s score, and for Carol, X. = X Carol = Carol’s score). 
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then try a different one. If you are looking at differences between variables you must apply 
the same transformation to all variables (you cannot, for example, apply a log transforma¬ 
tion to one variable and a square root transformation to another). This can be quite time 
consuming. 



|ANE SUPERBRAIN 5.1 


To transform or not to transform, that is the 
question (D 


Not everyone agrees that transforming data is a good 
idea; for example, Glass, Peckham and Sanders (1972), 
in a very extensive review, commented that ‘the payoff of 
normalizing transformations in terms of more valid prob¬ 
ability statements is low, and they are seldom considered 
to be worth the effort’ (p. 241). In which case, should we 
bother? 

The issue is quite complicated (especially for this 
early in the book), but essentially we need to know 
whether the statistical models we apply perform better 
on transformed data than they do when applied to data 
that violate the assumption that the transformation cor¬ 
rects. If a statistical model is still accurate even when 
its assumptions are broken it is said to be a robust test 
(section 5.8.4). I’m not going to discuss whether particu¬ 
lar tests are robust here, but I will discuss the issue for 
particular tests in their respective chapters. The question 
of whether to transform is linked to this issue of robust¬ 
ness (which in turn is linked to what test you are perform¬ 
ing on your data). 

A good case in point is the F-test in ANOVA (see 
Chapter 10), which is often claimed to be robust (Glass 
et al., 1972). Early findings suggested that F performed 
as it should in skewed distributions and that transform¬ 
ing the data helped as often as it hindered the accuracy 
of F (Games & Lucas, 1966). However, in a lively but 
informative exchange, Levine and Dunlap (1982) showed 
that transformations of skew did improve the perform¬ 
ance of F; however, in a response, Games (1983) argued 


that their conclusion was incorrect, which Levine and 
Dunlap (1983) contested in a response to the response. 
Finally, in a response to the response to the response, 
Games (1984) pointed out several important questions 
to consider: 

1. The central limit theorem (section 2.5.1) tells us 
that in big samples the sampling distribution will 
be normal regardless, and this is what’s actually 
important, so the debate is academic in anything 
other than small samples. Lots of early research 
did indeed show that with samples of 40 the nor¬ 
mality of the sampling distribution was, as pre¬ 
dicted, normal. However, this research focused 
on distributions with light tails and subsequent 
work has shown that with heavy-tailed distributions 
larger samples would be necessary to invoke the 
central limit theorem (Wilcox, 2005). This research 
suggests that transformations might be useful for 
such distributions. 

2. By transforming the data you change the hypoth¬ 
esis being tested (when using a log transformation 
and comparing means you change from comparing 
arithmetic means to comparing geometric means). 
Transformation also means that you’re now address¬ 
ing a different construct than the one originally 
measured, and this has obvious implications for 
interpreting that data (Gelman & Hill, 2007; Grayson, 
2004). 

3. In small samples it is tricky to determine normality one 
way or another (tests such as Shapiro-Wilk will have 
low power to detect deviations from normality and 
graphs will be hard to interpret with so few data points). 

4. The consequences for the statistical model of apply¬ 
ing the ‘wrong’ transformation could be worse than the 
consequences of analysing the untransformed scores. 

As we will see later in the book, there is an exten¬ 
sive library of robust tests that can be used and which 
have considerable benefits over transforming data. The 
definitive guide to these is Wilcox’s (2005) outstanding 
book. 
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5.8.3. 


Transforming the data using R © 


5.8.3.I. Computing new variables © 


Transformations are very easy using R. We use one of two general commands: 
newVariable <- function(oldVariable) 

in which function is the function we will use to transform the variable. Or possibly: 
newVariable <- arithmetic with oldVariable(s) 

Let’s first look at some of the simple arithmetic functions: 


+ Addition: We can add two variables together, or add a constant to our variables. For 
example, with our hygiene data, ‘dayl + day2’ creates a column in which each row 
contains the hygiene score from the column labelled dayl added to the score from the 
column labelled day2 (e.g., for participant 1: 2.65 + 1.35 = 4). In R we would execute: 

dlf$daylPlusDay2 <- dlf$dayl + dlf$day2 

which creates a new variable dayl PlusDay2 in the dlt dataframe based on adding the 
variables dayl and day2. 

Subtraction: We can subtract one variable from another. For example, we could 
subtract the day 1 hygiene score from the day 2 hygiene score. This creates a new 
variable in our dataframe in which each row contains the score from the column 
labelled dayl subtracted from the score from the column labelled day2 (e.g., for 
participant 1: 1.35 - 2.65 = -1.30). Therefore, this person’s hygiene went down by 
1.30 (on our 5-point scale) from day 1 to day 2 of the festival. In R we would execute: 
dlf$day2MinusDayl <- dlf$day2 - dlf$dayl 

which creates a new variable day2MinusDay1 in the dlf dataframe based on subtracting 
the variable dayl from day2. 

* Multiply: We can multiply two variables together, or we can multiply a variable by any 
number. In R, we would execute: 
dlf$day2Times5 <- dlf$dayl * 5 

which creates a new variable day2Times5 in the dlf dataframe based on multiplying 

dayl by 5. 

** Exponentiation: Exponentiation is used to raise the preceding term by the power 
OR ~ of the succeeding term. So ‘dayl **2’ or 'dayl ~ 2’ (it doesn’t matter which you use) 
creates a column that contains the scores in the dayl column raised to the power of 
2 (i.e., the square of each number in the dayl column: for participant 1,2.65 2 =7.02). 
Likewise, ‘day1**3’ creates a column with values of dayl cubed. In R, we would 
execute either: 

dlf$day2Squared <- dlf$day2 ** 2 
or 

dlf$day2Squared <- dlf$day2 A 2 

both of which create a new variable day2Squared in the dlf dataframe based on 
squaring values of day2. 
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< Less than: This is a logical operator - that means it gives the answer TRUE (or 1) 
or FALSE (or 0). If you typed ‘dayl < T, R would give the answer TRUE to those 
participants whose hygiene score on day 1 of the festival was less than 1 (i.e., if dayl was 
0.9999 or less). So, we might use this if we wanted to look only at the people who were 
already smelly on the first day of the festival. In R we would execute: 

dlf$daylLessThanOne <- dlf$dayl < 1 

to create a new variable dayl LessThanOne in the dlf dataframe for which the values 
are TRUE (or 1) if the value of dayl is less than 1, but FALSE (or 0) if the value of dayl is 
greater than 1. 

<= Less than or equal to: This is the same as above but returns a response of TRUE (or 1) if the 
value of the original variable is equal to or less than the value specified. In R we would execute: 

dlf$daylLessThanOrEqualOne <- dlf$dayl <= 1 

to create a new variable dayl LessThanOrEqualOne in the dlf dataframe for which the 
values are TRUE (or 1) if the value of dayl is less than or equal to 1, but FALSE (or 0) if 
the value of dayl is greater than 1. 

> Greater than: This is the opposite of the less than operator above. It returns a response 
of TRUE (or 1) if the value of the original variable is greater than the value specified. In R 
we would execute: 

dlf$daylGreaterThanOne <- dlf$dayl > 1 

to create a new variable dayl GreaterThanOne in the dlf dataframe for which the values 
are TRUE (or 1) if the value of dayl is greater than 1, but FALSE (or 0) if the value of 
dayl is less than 1. 

>= Greater than or equal to: This is the same as above but returns a response of TRUE (or 
1) if the value of the original variable is equal to or greater than the value specified. In R 
we would execute: 

dlf$daylGreaterThan0rEqual0ne <- dlf$dayl >= 1 

to create a new variable daylGreaterThanOrEqualOne in the dlf dataframe for which 
the values are TRUE (or 1) if the value of dayl is greater than or equal to 1, but FALSE 
(or 0) if the value of dayl is less than 1. 

== Double equals means ‘is equal to?’ It’s a question, rather than an assignment, like a single 
equals (=). Therefore, if we write something like dlf$gender = = “Male" we are asking ‘is the 
value of the variable gender in the dlf dataframe equal to the word 'Male'? In R, if we executed: 

dlf$male <- dlf$gender == "Male" 

we would create a variable male in the dlf dataframe that contains the value TRUE if the 
variable gender was the word ‘Male’ (spelt as it Is specified, including capital letters) and 
FALSE in all other cases. 

!= Not equal to. The opposite of = = . In R, if we executed: 
dlf$notMale <- dlf$gender != "Male" 

we would create a variable notMale in the dlf dataframe that contains the value TRUE 
if the variable gender was not the word ‘Male’ (spelt as it is specified including capital 
letters) and FALSE otherwise. 


Some of the most useful functions are listed in Table 5.2, which shows the standard form 
of the function, the name of the function, an example of how the function can be used 
and what R would output if that example were used. There are several basic functions for 
calculating means, standard deviations and sums of columns. There are also functions such 
as the square root and logarithm that are useful for transforming data that are skewed, and 
we will use these functions now. 
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Table 5.2 Some useful functions 


Function 

Name 

Input example 

Output 

rowMeans() 

Mean for a 

row 

rowMeans(cbind 
(dlf$day1, dlf$day2, 
dlf$day3), na.rm = 
TRUE) 

For each row, R calculates the mean 
hygiene score across the three days of the 
festival, na.rm tells R whether to exclude 
missing values from the calculation (see R's 
Souls’ Tip 5.3). 

rowSums() 

Sums for 

a row 

rowSums(cbind 
(dlfSdayl, dlf$day2, 
dlf$day3), na.rm = 
TRUE) 

For each row, R calculates the sum of the 
hygiene scores across the three days of the 
festival, na.rm tells R whether to exclude 
missing values from the calculation (see R's 
Souls’ Tip 5.4). 

sqrtO 

Square 

root 

sqrt(dlf$day2) 

Produces a column containing the square 
root of each value in the column labelled day2 

abs() 

Absolute 

value 

abs(dlf$day1) 

Produces a variable that contains the 

absolute value of the values in the column 
labelled dayl (absolute values are ones 
where the signs are ignored: so -5 
becomes +5 and +5 stays as +5) 

Iog10() 

Base 10 
logarithm 

Iog10(dlf$day1) 

Produces a variable that contains the logarithm 
(to base 10) values of the variable dayl. 

log() 

Natural 

logarithm 

Iog10(dlf$day1) 

Produces a variable that contains the natural 
logarithm values of the variable dayl. 

is.naQ 

Is 

missing? 

is.na(dlf$day1) 

This is used to determine if a variable is 
missing or not. If the variable is missing, 
the case will be assigned TRUE (or 1); if 
the case is not missing, the case will be 
assigned FALSE (or 0). 



The is.na() function and missing data (D 


If we want to count missing data, we can use is.na(). For example, if we want to know whether a person is missing 
for their day 2 hygiene score, we use: 


dlf$missingDay2 <- is.na(dlf$day2) 


But we can then use that variable in some clever ways. How many people were missing on day 2? Well, we know 
that the variable we just created is TRUE (or 1) if they are missing, so we can just add them up: 

sum(dlf$missingDay2) 


If we want to be lazy, we can embed those functions in each other, and not bother to create a variable: 

(sum(is.na(dlf$day2)) 

which tells us that 546 scores are missing. What proportion of scores is that? Well, we have a 1 if they are missing, 
and a zero if not. So the mean of that variable will be the proportion which are missing: 

mean(is.na(dlf$day2)) 


This tells us that the mean is 0.674, so 67.4% of people are missing a hygiene score on day 2 
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5.8.3.2. The log transformation in R © 

Now we’ve found out some basic information about the how to compute variables, let’s 
use it to transform our data. To transform the variable dayl, and create a new variable 
logdayl, we execute this command: 

dlf$logdayl <- log(dlf$dayl) 

This command creates a variable called logdayl in the d/fdataframe, which contains values 
that are the natural log of the values in the variable dayl. 

For the day 2 hygiene scores there is a value of 0 in the original data, and there is no 
logarithm of the value 0. To overcome this we should add a constant to our original 
scores before we take the log of those scores. Any constant will do, provided that it 
makes all of the scores greater than 0. In this case our lowest score is 0 in the data set so 
we can simply add 1 to all of the scores and that will ensure that all scores are greater 
than zero. 

The advantage of adding 1 is that the logarithm of 1 is equal to 0, so people who scored 
a zero before the transformation score a zero after the transformation. To do this transfor¬ 
mation we would execute: 

dlf$logdayl <- log(dlf$dayl + 1) 

This command creates a variable called logdayl in the dlf dataframe, which contains values 
that are the natural log of the values in the variable dayl after 1 has been added to them. 



SELF-TEST 

s Have a go at creating similar variables logday2 
and logday3 for the day2 and day3 variables. Plot 
histograms of the transformed scores for all three 
days. 


5.8.3.3. The square root transformation in R © 

To do a square root transformation, we run through the same process, by using a name 
such as sqrtdayl. Therefore, to create a variable called sqrtdayl that contains the square 
root of the values in the variable dayl, we would execute: 

dlf$sqrtdayl <- sqrt(dayl) 



SELF-TEST 

s Repeat this process for day2 and day3 to create 
variables called sqrtday2 and sqrtday3. Plot 
histograms of the transformed scores for all three 
days. 


5.8.3.4. The reciprocal transformation in R © 

To do a reciprocal transformation on the data from day 1, we don’t use a function, we use 
an arithmetic expression: 1/variable. However, the day 2 data contain a zero value and if 
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we try to divide 1 by 0 then we’ll get an error message (you can’t divide by 0). As such 
we need to add a constant to our variable just as we did for the log transformation. Any 
constant will do, but 1 is a convenient number for these data. We could use a name such as 
recdayl, and to create this variable we would execute: 

dlf$recdayl <- l/(dlf$dayl + 1) 



SELF-TEST 

s Repeat this process for day2 and day3. Plot 
histograms of the transformed scores for all three 
days. 


5.8.3.5. The ifelseQ function in R (D 


The ifelse() function is used to create a new variable, or change an old variable, depending 
on some other values. This function takes the general form: 

ifelse(a conditional argument, what happens if the argument is TRUE, what 
happens if the argument if FALSE) 

This function needs three arguments: a conditional argument to test, what to do if the test 
is true, and what to do if the test is false. Let’s use the original data where there was an 
outlier in the dayl hygiene score. We can detect this outlier because we know that the high¬ 
est score possible on the scale was 4. Therefore, we could set our conditional argument to 
be dlf$dayl > 4, which means we’re saying ‘if the value of dayl is greater than 4 then ...’. 
The rest of the function tells it what to do, for example, we might want to set it to missing 
(NA) if the score is over 4, but keep it as the old score if the score is not over 4. In which 
case we could execute this command: 

dlf$daylNoOutlier <- ifelse(dlf$dayl > 4, NA, dlf$dayl) 

This command creates a new variable called daylNoOutlier which takes the value NA if 
dayl is greater than 4, but is the value of dayl if dayl is less than 4: 


/ \ 

If yes, then the 
new variable is set 
to NA (Missing) 



dlf$daylNoOutlier <-ifelse(dlf$dayl > 5, NA, dlf$dayl) 
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Careful with missing data © 


If you have any missing data in your variables, you need to be careful when using functions such as rowMeans(), 
to get the answer that you want. The problem is what you do when you have some missing values. Here’s a prob¬ 
lem: I have 2 oranges and 3 apples. How many fruits do I have? Obviously, I have a total of 5 fruits. 

You have 2 oranges, and we don’t know how many apples - this value is missing. How many fruits do you 
have? We could say that you have 2. Or we could say that we don’t know: the answer is missing. If you add apples 
and oranges in R, most functions will tell you that the answer is NA (unknown). 


apples <- 2 
oranges <- NA 
apples + oranges 


[1] NA 

The rowSums and rowMeans functions will allow you to choose what to do with missing data, by using the na.rm 
option, which asks ‘should missing values (na) be removed [rm)7 

To obtain the mean hygiene score across three days, removing anyone with any missing values, we would use: 

dlf$meanHygiene <- rowMeans(cbind(dlf$dayl, dlf$day2, dlf$day3)) 

But a lot of people would be missing. If we wanted to use everyone who had at least one score for the three days, 
we would add na.rm=TRUE: 

dlf$meanHygiene <- rowMeans(cbind(dlf$dayl, dlf$day2, dlf$day3), na.rm = TRUE) 

But what would we do if we had 100 days of hygiene scores? And if we didn’t mind if people were missing one or 
two scores, but we didn’t want to calculate a mean for people who only had one score? Well, we’d use the is.na() 
function first, to count the number of missing variables. 

dlf$daysMissing <- rowSums (cbind (is.na(dlf$dayl), 

is.na(dlf$day2), 

is.na(dlf$day3))) 

(It’s OK to break a command across rows like that, and sometimes it makes it easier to see that you didn’t make 
a mistake.) Then we can use the ifelse() function to calculate values only for those people who have a score on 
at least two days: 

dlf$meanHygiene <- ifelse(dlf$daysMissing < 2, NA, 

rowMeans(cbind( dlf$dayl, 
dlf$day2, 
dlf$day3), 
na.rm=TRUE)) 

Notice how I’ve used spacing so it’s clear which arguments go with which function? That makes it (slightly) easier 
to avoid making mistakes. 11 


5.8.3.6. The effect of transformations © 


Figure 5.9 shows the distributions for days 1 and 2 of the festival after the three different 
transformations. Compare these to the untransformed distributions in Figure 5.2. Now, 
you can see that all three transformations have cleaned up the hygiene scores for day 2: 


11 It still took me three tries to get this right. 
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the positive skew is reduced (the square root transformation in particular has been useful). 
However, because our hygiene scores on day 1 were more or less symmetrical to begin 
with, they have now become slightly negatively skewed for the log and square root trans¬ 
formation, and positively skewed for the reciprocal transformation! 12 If we’re using scores 

FIGURE 5.9 Day 1 of Download Day 2 of Download 

Distributions of 




Log Transformed Hygiene Score in Day 1 Log Transformed Hygiene Score in Day 2 




Square Root of Hygiene Score on Day 1 Square Root of Hygiene Score in Day 2 




Reciprocal of Hygiene Score in Day 1 Reciprocal of Hygiene Score in Day 2 

12 The reversal of the skew for the reciprocal transformation is because, as I mentioned earlier, the reciprocal has 
the effect of reversing the scores. 
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from day 2 alone then we could use the transformed scores; however, if we wanted to look 
at the change in scores then we’d have to weigh up whether the benefits of the transforma¬ 
tion for the day 2 scores outweigh the problems it creates in the day 1 scores - data analysis 
can be frustrating sometimes! 


5.8.4. 


When it all goes horribly wrong (D 


It’s very easy to think that transformations are the answers to all of your broken assump¬ 
tion prayers. However, as we have seen, there are reasons to think that transformations 
are not necessarily a good idea (see Jane Superbrain Box 5.1), and even if you think that 
they are they do not always solve the problem, and even when they do solve 
the problem they often create different problems in the process. This happens 
more frequently than you might imagine (messy data are the norm). 

If you find yourself in the unenviable position of having irksome data then 
there are some other options available to you (other than sticking a big samu¬ 
rai sword through your head). The first is to use a test that does not rely on the 
assumption of normally distributed data, and as you go through the various 
chapters of this book I’ll point out these tests - there is also a whole chapter 
dedicated to them later on. 13 One thing that you will quickly discover about 
non-parametric tests is that they have been developed for only a fairly limited 
range of situations. So, happy days if you want to compare two means, but sad 
and lonely days listening to Joy Division if you have a complex experimental 
design. 

A much more promising approach is to use robust methods (which I mentioned in Jane 
Superbrain Box 5.1). These tests have developed as computers have got more sophisticated 
(doing these tests without computers would be only marginally less painful than ripping 
off your skin and diving into a bath of salt). How these tests work is beyond the scope of 
this book (and my brain), but two simple concepts will give you the general idea. Some 
of these procedures use a trimmed mean. A trimmed mean is simply a mean based on the 
distribution of scores after some percentage of scores has been removed from each extreme 
of the distribution. So, a 10% trimmed mean will remove 10% of scores from the top and 
bottom before the mean is calculated. With trimmed means you have to specify the amount 
of trimming that you want; for example, you must decide to trim 5%, 10% or perhaps even 
20% of scores. A similar robust measure of location is an M-estimator, which differs from a 
trimmed mean in that the amount of trimming is determined empirically. In other words, 
rather than the researcher deciding before the analysis how much of the data to trim, an 
M-estimator determines the optimal amount of trimming necessary to give a robust esti¬ 
mate of, say, the mean. This has the obvious advantage that you never over- or under-trim 
your data; however, the disadvantage is that it is not always possible to reach a solution. In 
other words, robust tests based on M-estimators don’t always give you an answer. 

We saw in Chapter 2 that the accuracy of the mean depends on a symmetrical distribu¬ 
tion, but a trimmed mean (or M-estimator) produces accurate results even when the dis¬ 
tribution is not symmetrical, because by trimming the ends of the distribution we remove 
outliers and skew that bias the mean. Some robust methods work by taking advantage of 
the properties of the trimmed mean and M-estimator. 


What do I do if my 
transformation 
doesn’t work?, 


13 For convenience a lot of textbooks refer to these tests as non-parametric tests or assumption-free tests and 
stick them in a separate chapter. Actually neither of these terms are particularly accurate (none of these tests is 
assumption-free) but in keeping with tradition I’ve put them in a chapter on their own (Chapter 15), ostracized 
from their ‘parametric’ counterparts and feeling lonely. 
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The second general procedure is the bootstrap (Efron & Tibshirani, 1993). The idea of 
the bootstrap is really very simple and elegant. The problem that we have is that we don’t 
know the shape of the sampling distribution, but normality in our data allows us to infer 
that the sampling distribution is normal (and hence we can know the probability of a par¬ 
ticular test statistic occurring). Lack of normality prevents us from knowing the shape of 
the sampling distribution unless we have big samples (but see Jane Superbrain Box 5.1). 
Bootstrapping gets around this problem by estimating the properties of the sampling dis¬ 
tribution from the sample data. In effect, the sample data are treated as a population from 
which smaller samples (called bootstrap samples) are taken (putting the data back before 
a new case is drawn). The statistic of interest (e.g., the mean) is calculated in each 
sample, and by taking many samples the sampling distribution can be estimated (rather 
like in Figure 2.7). The standard error of the statistic is estimated from the standard devia¬ 
tion of this sampling distribution created from the bootstrap samples. From this standard 
error, confidence intervals and significance tests can be computed. This is a very neat way 
of getting around the problem of not knowing the shape of the sampling distribution. The 
bootstrap can be used in conjunction with trimmed means and M-estimators. For a fairly 
gentle introduction to the concept of bootstrapping see Wright, London, and Field (2011). 

There are numerous robust tests based on trimmed means, bootstrapping and 
M-estimators described by Rand Wilcox (Figure 5.10) in his definitive text (Wilcox, 2005). 
He has also written functions in R to do these tests (which, when you consider the number 
of tests in his book, is a feat worthy of anyone’s respect and admiration). We cover quite a 
few of these tests in this book. 

There are two ways to access these functions: from a package, and direct from Wilcox’s 
website. The package version of the tests is called WRS (although it is what’s known as a 
beta version, which means it is not complete). 14 To access this package in R we need to 
execute: 

install.packagesC'WRS", repos="http://R-Forge.R-project.org") 
libraryCWRS) 

This is a standard install procedure, but note that we have to include repos=http://R- 
Forge.R-project.org because it is not a full package and this instruction tells R where to 
find the package. This package is not always implemented in the most recent versions of R 
(because it is only a beta) and it is not kept as up to date as Wilcox’s webpage, so although 
we tend to refer to the package, to be consistent with the general ethos of downloading 
packages, you should also consider sourcing the functions from Wilcox’s website. One 
advantage of the website is that he keeps the functions very up to date. To source the func¬ 
tions from his website, execute: 

source("http://www-rcf.use.edu/~rwilcox/Rallfun-vl4") 

This command uses the source() function to access the webpage where Wilcox stores 
the functions (as a text file). Rallfun-vl4 is the name of the file (short for ‘R all functions - 
version 14’). Without wishing to state the obvious, you need to be connected to the Internet 
for this command to work. Depending on this book’s shelf-life, it is possible that the name 
of the file might change (most likely to Rallfun-vl5 or Rallfun-vl6 ), so if you get an error 
try replacing the vl4 at the end with vl5 and so on. It’s also possible that Rand might move 
his webpage (http://www-rcf.usc.edu/~rwilcox/) in which case Google him, locate the lat¬ 
est Rallfun file and replace the URL in the source function above with the new one. Having 
either loaded the package or sources the file from the web, you now have access to all of 
the functions in Wilcox’s book. 


14 Actually, all of the functions are there, but there is very little documentation about what they do, which is why 
it is only at the ‘beta’ stage rather than being a full release. 
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FIGURE 5.10 

The absolute 
legend that is Rand 
Wilcox, who is the 
man you almost 
certainly ought to 
thank if you want 
to do a robust test 
in R 




What have I discovered about statistics? © 


‘You promised us swans,’ I hear you cry, ‘and all we got was normality this, homosome- 
thingorother that, transform this, it’s all a waste of time that. Where were the bloody 
swans?!’ Well, the Queen owns them all so I wasn’t allowed to have them. Nevertheless, 
this chapter did negotiate Dante’s eighth circle of hell (Malebolge), where data of deliber¬ 
ate and knowing evil dwell. That is, data that don’t conform to all of those pesky assump¬ 
tions that make statistical tests work properly. We began by seeing what assumptions 
need to be met for parametric tests to work, but we mainly focused on the assumptions 
of normality and homogeneity of variance. To look for normality we rediscovered the 
joys of frequency distributions, but also encountered some other graphs that tell us about 
deviations from normality (Q-Q plots). We saw how we can use skew and kurtosis values 
to assess normality and that there are statistical tests that we can use (the Shapiro-Wilk 
test). While negotiating these evildoers, we discovered what homogeneity of variance is, 
and how to test it with Levene’s test and Hartley’s F max . Finally, we discovered redemp¬ 
tion for our data. We saw we can cure their sins, make them good, with transformations 
(and on the way we discovered some of the uses of the by() function and the transforma¬ 
tion functions). Sadly, we also saw that some data are destined always to be evil. 

We also discovered that I had started to read. However, reading was not my true pas¬ 
sion; it was music. One of my earliest memories is of listening to my dad’s rock and soul 
records (back in the days of vinyl) while waiting for my older brother to come home 
from school, so I must have been about 3 at the time. The first record I asked my parents 
to buy me was ‘Take on the World’ by Judas Priest, which I’d heard on Top of the Pops (a 
now defunct UK TV show) and liked. This record came out in 1978 when I was 5. Some 
people think that this sort of music corrupts young minds. Let’s see if it did ... 
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R packages used in this chapter 


car 

psych 

ggplot2 

Rcmdr 

pastecs 


R functions used in this chapter 

abs() 

qplot() 

byO 

rowMeansO 

cbindO 

rowSumsO 

describe!) 

round() 

dnorm() 

shapiro.test() 

ifelseO 

source)) 

is.naO 

sqrt() 

leveneTest() 

stat.desc() 

iog() 

stat_function() 

loglOQ 

tappiyO 

Key terms that I’ve discovered 

Bootstrap 

Normally distributed 

Hartle y' S F max 

Parametric test 

Heterogeneity of variance 

Q-Q plot 

Homogeneity of variance 

Quantile 

Independence 

Robust test 

Interval data 

Shapiro-Wilk test 

Levene’s test 

Transformation 

Log 

Trimmed mean 

M-estimator 

Variance ratio 




Smart Alex’s tasks 


• Task 1: Using the ChickFlick.dat data from Chapter 4, check the assumptions of 
normality and homogeneity of variance for the two films (ignore gender): are the 
assumptions met? © 

• Task 2: Remember that the numeracy scores were positively skewed in the RExam. 
dat data (see Figure 5.5)? Transform these data using one of the transformations 
described in this chapter: do the data become normal? © 

Answers can be found on the companion website. 

Further reading 


Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics (5th ed.). Boston: Allyn & 
Bacon. (Chapter 4 is the definitive guide to screening data!) 

Wilcox, R. R. (2005). Introduction to robust estimation and hypothesis testing (2nd ed.). Burlington, 
MA: Elsevier. (Quite technical, but this is the definitive book on robust methods.) 

Wright, D. B., London, K., & Field, A. E (2011). Using bootstrap estimation and the plug-in principle 
for clinical psychology data. Journal of Experimental Psychopathology, 2(2), 252-270. (A fairly 
gentle introduction to bootstrapping in R.) 











Correlation 




FIGURE 6.1 

I don’t have 
a photo from 
Christmas 1981, 
but this was taken 
about that time at 
my grandparents’ 
house. I’m trying 
to play an ‘E’ by 
the looks of it, no 
doubt because 
it’s in Take on the 
World’. 


6.1. What will this chapter tell me? © 


When I was 8 years old, my parents bought me a guitar for Christmas. Even then, I’d des¬ 
perately wanted to play the guitar for years. I could not contain my excitement at getting 
this gift (had it been an electric guitar I think I would have actually exploded with excite¬ 
ment). The guitar came with a ‘learn to play’ book and, after a little while of trying to play 
what was on page 1 of this book, I readied myself to unleash a riff of universe-crushing 
power onto the world (well, ‘Skip to my Lou’ actually). But, I couldn’t do it. I burst into 
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tears and ran upstairs to hide. 1 My dad sat with me and said ‘Don’t worry, Andy, everything 
is hard to begin with, but the more you practise the easier it gets.’ In his comforting words, 
my dad was inadvertently teaching me about the relationship, or correlation, between two 
variables. These two variables could be related in three ways: (1) positively related, mean¬ 
ing that the more I practised my guitar, the better a guitar player I would become (i.e., my 
dad was telling me the truth); (2) not related at all, meaning that as I practised the guitar my 
playing ability would remain completely constant (i.e., my dad has fathered a cretin); or (3) 
negatively related, which would mean that the more I practised my guitar the worse a gui¬ 
tar player I would become (i.e., my dad has fathered an indescribably strange child). This 
chapter looks first at how we can express the relationships between variables statistically by 
looking at two measures: covariance and the correlation coefficient. We then discover how 
to carry out and interpret correlations in R. The chapter ends by looking at more complex 
measures of relationships; in doing so it acts as a precursor to multiple regression, which 
we discuss in Chapter 7. 


6.2. Looking at relationships © 



In Chapter 4 I stressed the importance of looking at your data graphically before 
running any other analysis on them. I just want to begin by reminding you that our 
first starting point with a correlation analysis should be to look at some scatter- 
plots of the variables we have measured. I am not going to repeat how to get R to 
produce these graphs, but I am going to urge you (if you haven’t done so already) 
to read section 4.5 before embarking on the rest of this chapter. 


6.3. How do we measure relationships? © 


A detour into the murky world of covariance © 


The simplest way to look at whether two variables are associated is to look at whether they 
covary. To understand what covariance is, we first need to think back to the concept of 
variance that we met in Chapter 2. Remember that the variance of a single variable repre¬ 
sents the average amount that the data vary from the mean. Numerically, it is described by: 

(x, -x) 2 _ Y,(x,-x)(x, -x) ^ 

N-l ~~ N- 1 

The mean of the sample is represented by x , x is the data point in question and N is the 
number of observations (see section 2.4.1). If we are interested in whether two variables 
are related, then we are interested in whether changes in one variable are met with similar 
changes in the other variable. Therefore, when one variable deviates from its mean we 
would expect the other variable to deviate from its mean in a similar way. To illustrate what 
I mean, imagine we took five people and subjected them to a certain number of advertise¬ 
ments promoting toffee sweets, and then measured how many packets of those sweets each 


Variance(s 2 ) = 


1 This is not a dissimilar reaction to the one I have when publishers ask me for new editions of statistics textbooks. 
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Table 6.1 Adverts watched and toffee purchases 


Participant: 

1 

2 

3 

4 

5 

Mean 

S 

Adverts watched 

5 

4 

4 

6 

8 

5.4 

1.67 

Packets bought 

8 

9 

10 

13 

15 

11.0 

2.92 


person bought during the next week. The data are in Table 6.1 as well as the mean and 
standard deviation (s) of each variable. 

If there were a relationship between these two variables, then as one variable deviates 
from its mean, the other variable should deviate from its mean in the same or the directly 
opposite way. Figure 6.2 shows the data for each participant (light blue circles represent the 
number of packets bought and dark blue circles represent the number of adverts watched); 
the grey line is the average number of packets bought and the blue line is the average num¬ 
ber of adverts watched. The vertical lines represent the differences (remember that these 
differences are called deviations) between the observed values and the mean of the relevant 
variable. The first thing to notice about Figure 6.2 is that there is a very similar pattern of 
deviations for both variables. For the first three participants the observed values are below 
the mean for both variables, for the last two people the observed values are above the mean 
for both variables. This pattern is indicative of a potential relationship between the two 
variables (because it seems that if a person’s score is below the mean for one variable then 
their score for the other will also be below the mean). 

So, how do we calculate the exact similarity between the patterns of differences of the 
two variables displayed in Figure 6.2? One possibility is to calculate the total amount of 
deviation but we would have the same problem as in the single variable case: the positive 
and negative deviations would cancel out (see section 2.4.1). Also, by simply adding the 
deviations, we would gain little insight into the relationship between the variables. Now, in 
the single variable case, we squared the deviations to eliminate the problem of positive and 
negative deviations cancelling out each other. When there are two variables, rather than 
squaring each deviation, we can multiply the deviation for one variable by the correspond¬ 
ing deviation for the second variable. If both deviations are positive or negative then this 
will give us a positive value (indicative of the deviations being in the same direction), but 



Person 


FIGURE 6.2 

Graphical display 
of the differences 
between the 
observed data and 
the means of two 
variables 
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if one deviation is positive and one negative then the resulting product will be negative 
(indicative of the deviations being opposite in direction). When we multiply the deviations 
of one variable by the corresponding deviations of a second variable, we get what is known 
as the cross-product deviations. As with the variance, if we want an average value of the 
combined deviations for the two variables, we must divide by the number of observations 
(we actually divide by N — 1 for reasons explained in Jane Superbrain Box 2.2). This aver¬ 
aged sum of combined deviations is known as the covariance. We can write the covariance 
in equation form as in equation (6.2) - you will notice that the equation is the same as the 
equation for variance, except that instead of squaring the differences, we multiply them by 
the corresponding difference of the second variable: 


co v(x,y) 


^(x, -x)(y,- -y) 
N -1 


For the data in Table 6.1 and Figure 6.2 we reach the following value: 


co v(x,y) = 


X( x / ~ x )(y,- ~y) 

N -1 

(-0.4)(-3) + (-1.4)(-2) + (-1.4)(-l) + (0.6)(2) + (2.6)(4) 

4 

1.2 + 2.8 + 1.4 + 1.2 + 10.4 


4 

- 11 
” T 

= 4.25 


( 6 . 2 ) 


Calculating the covariance is a good way to assess whether two variables are related to 
each other. A positive covariance indicates that as one variable deviates from the mean, 
the other variable deviates in the same direction. On the other hand, a negative covariance 
indicates that as one variable deviates from the mean (e.g., increases), the other deviates 
from the mean in the opposite direction (e.g., decreases). 

There is, however, one problem with covariance as a measure of the relationship between 
variables and that is that it depends upon the scales of measurement used. So, covariance is 
not a standardized measure. For example, if we use the data above and assume that they rep¬ 
resented two variables measured in miles then the covariance is 4.25 (as calculated above). If 
we then convert these data into kilometres (by multiplying all values by 1.609) and calculate 
the covariance again then we should find that it increases to 11. This dependence on the 
scale of measurement is a problem because it means that we cannot compare covariances 
in an objective way - so, we cannot say whether a covariance is particularly large or small 
relative to another data set unless both data sets were measured in the same units. 


6.3.2. 


Standardization and the correlation coefficient © 


To overcome the problem of dependence on the measurement scale, we need to convert 
the covariance into a standard set of units. This process is known as standardization. A very 
basic form of standardization would be to insist that all experiments use the same units 
of measurement, say metres - that way, all results could be easily compared. However, 
what happens if you want to measure attitudes - you’d be hard pushed to measure them 
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in metres. Therefore, we need a unit of measurement into which any scale of measurement 
can be converted. The unit of measurement we use is the standard deviation. We came 
across this measure in section 2.4.1 and saw that, like the variance, it is a measure of the 
average deviation from the mean. If we divide any distance from the mean by the standard 
deviation, it gives us that distance in standard deviation units. For example, for the data in 
Table 6.1, the standard deviation for the number of packets bought is approximately 3.0 
(the exact value is 2.92). In Figure 6.2 we can see that the observed value for participant 
1 was 3 packets less than the mean (so there was an error of —3 packets of sweets). If we 
divide this deviation, -3, by the standard deviation, which is approximately 3, then we get 
a value of —1. This tells us that the difference between participant l’s score and the mean 
was —1 standard deviation. So, we can express the deviation from the mean for a partici¬ 
pant in standard units by dividing the observed deviation by the standard deviation. 

It follows from this logic that if we want to express the covariance in a standard unit of 
measurement we can simply divide by the standard deviation. However, there are two vari¬ 
ables and, hence, two standard deviations. Now, when we calculate the covariance we actu¬ 
ally calculate two deviations (one for each variable) and then multiply them. Therefore, 
we do the same for the standard deviations: we multiply them and divide by the product 
of this multiplication. The standardized covariance is known as a correlation coefficient and 
is defined by equation (6.3), in which s is the standard deviation of the first variable and 
s y is the standard deviation of the second variable (all other letters are the same as in the 
equation defining covariance): 

cov * y ^(x,--x)(y,-y) 

V y (N-l )s x s y 

The coefficient in equation (6.3) is known as the Pearson product-moment correlation coeffi¬ 
cient or Pearson correlation coefficient (for a really nice explanation of why it was originally 
called the ‘product-moment’ correlation, see Miles &C Banyard, 2007) and was invented by 
Karl Pearson (see Jane Superbrain Box 6.1). 2 If we look back at Table 6.1 we see that the 
standard deviation for the number of adverts watched ( s x ) was 1.67, and for the number 
of packets of crisps bought (s y ) was 2.92. If we multiply these together we get 1.67 x 2.92 
= 4.88. Now, all we need to do is take the covariance, which we calculated a few pages 
ago as being 4.25, and divide by these multiplied standard deviations. This gives us r = 
4.25/4.88 = .87. 

By standardizing the covariance we end up with a value that has to lie between —1 
and +1 (if you find a correlation coefficient less than —1 or more than +1 you can be 
sure that something has gone hideously wrong!). A coefficient of +1 indicates that the 
two variables are perfectly positively correlated, so as one variable increases, the other 
increases by a proportionate amount. Conversely, a coefficient of —1 indicates a perfect 
negative relationship: if one variable increases, the other decreases by a proportionate 
amount. A coefficient of zero indicates no linear relationship at all and so if one variable 
changes, the other stays the same. We also saw in section 2.6.4 that because the correla¬ 
tion coefficient is a standardized measure of an observed effect, it is a commonly used 
measure of the size of an effect and that values of ±.l represent a small effect, ±.3 is a 
medium effect and ±.5 is a large effect (although I re-emphasize my caveat that these 
canned effect sizes are no substitute for interpreting the effect size within the context of 
the research literature). 


2 You will find Pearson’s product-moment correlation coefficient denoted by both r and R. Typically, the upper¬ 
case form is used in the context of regression because it represents the multiple correlation coefficient; however, 
for some reason, when we square r (as in section 6.5.4.3) an upper case R is used. Don’t ask me why - it’s just 
to confuse me, I suspect. 
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JANE SUPERBRAIN 6.1 

Who said statistics was dull? © 

Students often think that statistics is dull, but back in the 
early 1900s it was anything but dull, with various promi¬ 
nent figures entering into feuds on a soap opera scale. 
One of the most famous was between Karl Pearson and 
Ronald Fisher (whom we met in Chapter 2). It began 
when Pearson published a paper of Fisher’s in his journal 
but made comments in his editorial that, to the casual 
reader, belittled Fisher’s work. Two years later Pearson’s 
group published work following on from Fisher’s paper 
without consulting him. The antagonism persisted with 
Fisher turning down a job to work in Pearson’s group and 
publishing ‘improvements’ on Pearson’s ideas. Pearson 
for his part wrote in his own journal about apparent errors 
made by Fisher. 


Another prominent statistician, Jerzy Neyman, criti¬ 
cized some of Fisher’s most important work in a paper 
delivered to the Royal Statistical Society on 28 March 
1935 at which Fisher was present. Fisher’s discussion 
of the paper at that meeting directly attacked Neyman. 
Fisher more or less said that Neyman didn’t know 
what he was talking about and didn’t understand the 
background material on which his work was based. 
Relations soured so much that while they both worked 
at University College London, Neyman openly attacked 
many of Fisher’s ideas in lectures to his students. The 
two feuding groups even took afternoon tea (a com¬ 
mon practice in the British academic community of the 
time) in the same room but at different times! The truth 
behind who fuelled these feuds is, perhaps, lost in the 
mists of time, but Zabell (1992) makes a sterling effort 
to unearth it. 

Basically, then, the founders of modern statisti¬ 
cal methods were a bunch of squabbling children. 
Nevertheless, these three men were astonishingly gifted 
individuals. Fisher, in particular, was a world leader in 
genetics, biology and medicine as well as possibly the 
most original mathematical thinker ever (Barnard, 1963; 
Field, 2005c; Savage, 1976). 


6.3.3. 


The significance of the correlation coefficient (D 


Although we can directly interpret the size of a correlation coefficient, we have seen in 
Chapter 2 that scientists like to test hypotheses using probabilities. In the case of a correla¬ 
tion coefficient we can test the hypothesis that the correlation is different from zero (i.e., 
different from ‘no relationship”). If we find that our observed coefficient was very unlikely 
to happen if there was no effect in the population, then we can gain confidence that the 
relationship that we have observed is statistically meaningful. 

There are two ways that we can go about testing this hypothesis. The first is to use our 
trusty z-scores that keep cropping up in this book. As we have seen, z-scores are useful 
because we know the probability of a given value of z occurring, if the distribution from 
which it comes is normal. There is one problem with Pearson’s r, which is that it is known 
to have a sampling distribution that is not normally distributed. This is a bit of a nuisance, 
but luckily, thanks to our friend Fisher, we can adjust r so that its sampling distribution is 
normal as follows (Fisher, 1921): 


z r 




The resulting z r has a standard error of: 


SE 


Z r 


1 


(6.4) 


VN-3 


(6.5) 
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For our advert example, our r = .87 becomes 1.33 with a standard error of .71. 

We can then transform this adjusted r into a z-score just as we have done for raw scores, 
and for skewness and kurtosis values in previous chapters. If we want a z-score that rep¬ 
resents the size of the correlation relative to a particular value, then we simply compute 
a z-score using the value that we want to test against and the standard error. Normally 
we want to see whether the correlation is different from 0, in which case we can subtract 
0 from the observed value of r and divide by the standard error (in other words, we just 
divide Z r by its standard error): 


z = 



( 6 . 6 ) 


For our advert data this gives us 1.33/.71 = 1.87. We can look up this value of z (1.87) 
in the table for the normal distribution in the Appendix and get the one-tailed probability 
from the column labelled ‘Smaller Portion’. In this case the value is .0307. To get the two- 
tailed probability we simply multiply the one-tailed probability value by 2, which gives us 
.0614. As such the correlation is significant, p < .05, one-tailed, but not two-tailed. 

In fact, the hypothesis that the correlation coefficient is different from 0 is usually (R, 
for example, does this) tested not using a z-score, but using a t-statistic with N — 2 degrees 
of freedom, which can be directly obtained from r: 


t r 


r^N-2 


(6.7) 


You might wonder then why I told you about ^-scores, then. Partly it was to keep the dis¬ 
cussion framed in concepts with which you are already familiar (we don’t encounter the 
t-test properly for a few chapters), but also it is useful background information for the next 
section. 


Confidence intervals for r© 


Confidence intervals tell us something about the likely value (in this case of the correlation) 
in the population. To understand how confidence intervals are computed for r, we need to 
take advantage of what we learnt in the previous section about converting r to z (to make 
the sampling distribution normal), and using the associated standard errors. We can then 
construct a confidence interval in the usual way. For a 95% confidence interval we have 
(see section 2.5.2.1): 


lower boundary of confidence interval = X - (1.96 x SE) 


upper boundary of confidence interval = X + (1.96 x SE) 


In the case of our transformed correlation coefficients these equations become: 
lower boundary of confidence interval = Z r - (1.96 x SE Z ) 
upper boundary of confidence interval = z r + (1.96 x SE^) 







212 


DISCOVERING STATISTICS USING R 


For our advert data this gives us 1.33 - (1.96 X .71) = -0.062, and 1.33 + (1.96 X .71) 
= 2.72. Remember that these values are in the z r metric and so we have to convert back to 
correlation coefficients using: 




r = ■ 


,(2*r) 


( 6 . 8 ) 


+ 1 


This gives us an upper bound of r = .991 and a lower bound of-0.062 (because this value 
is so close to zero the transformation to z has no impact). 



CRAMMING SAM’S TIPS 


Correlation 


• A crude measure of the relationship between variables is the covariance. 

• If we standardize this value we get Pearson’s correlation coefficient, r. 

• The correlation coefficient has to lie between -1 and +1. 

• A coefficient of +1 indicates a perfect positive relationship, a coefficient of -1 indicates a perfect negative relationship, and 
a coefficient of 0 indicates no linear relationship at all. 

• The correlation coefficient is a commonly used measure of the size of an effect: values of +.1 represent a small effect, +.3 
is a medium effect and +.5 is a large effect. However, if you can, try to interpret the size of correlation within the context of 
the research you’ve done rather than blindly following these benchmarks. 


6.3.5. 


A word of warning about interpretation: causality © 


Considerable caution must be taken when interpreting correlation coefficients because 
they give no indication of the direction of causality. So, in our example, although we can 
conclude that as the number of adverts watched increases, the number of packets of toffees 
bought increases also, we cannot say that watching adverts causes you to buy packets of 
toffees. This caution is for two reasons: 

• The third-variable problem: We came across this problem in section 1.6.2. To recap, 
in any correlation, causality between two variables cannot be assumed because there 
may be other measured or unmeasured variables affecting the results. This is known as 
the third-variable problem or the tertium quid (see section 1.6.2 and Jane Superbrain 
Box 1.1). 

• Direction of causality: Correlation coefficients say nothing about which variable 
causes the other to change. Even if we could ignore the third-variable problem 
described above, and we could assume that the two correlated variables were the only 
important ones, the correlation coefficient doesn’t indicate in which direction causal¬ 
ity operates. So, although it is intuitively appealing to conclude that watching adverts 
causes us to buy packets of toffees, there is no statistical reason why buying packets 
of toffees cannot cause us to watch more adverts. Although the latter conclusion 
makes less intuitive sense, the correlation coefficient does not tell us that it isn’t true. 
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6.4. Data entry for correlation analysis © 


Data entry for correlation, regression and multiple regression is straightforward because 
each variable is entered in a separate column. If you are preparing your data in software 
other than R then this means that, for each variable you have measured, you create a vari¬ 
able in the spreadsheet with an appropriate name, and enter a participant’s scores across 
one row of the spreadsheet. There may be occasions on which you have one or more cat¬ 
egorical variables (such as gender) and these variables can also be entered in a column - see 
section 3.7 for more detail. 

As an example, if we wanted to calculate the correlation between the two variables in 
Table 6.1 we would enter these data as in Figure 6.3. You can see that each variable is 
entered in a separate column, and each row represents a single individual’s data (so the first 
consumer saw 5 adverts and bought 8 packets). 

If you have a small data set you might want to enter the variables directly into R and 
then create a dataframe from them. For the advert data this can be done by executing the 
following commands (see section 3.5): 

adverts<-c(5,4,4,6,8) 
packets<-c(8,9,10,13,15) 
advertData<-data.frame(adverts, packets) 



FIGURE 6.3 

Data entry for 
correlation using 
Excel 



SELF-TEST 

s Enter the advert data and use ggplot2 to produce a 
scatterplot (number of packets bought on they-axis, 
and adverts watched on the x-axis) of the data. 


6.5. Bivariate correlation © 


There are two types of correlation: bivariate and partial. A bivariate correlation is a cor¬ 
relation between two variables (as described at the beginning of this chapter) whereas a 
partial correlation (see section 6.6) looks at the relationship between two variables while 
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‘controlling’ the effect of one or more additional variables. Pearson’s product-moment cor¬ 
relation coefficient (described earlier), Spearman’s rho (see section 6.5.5) and Kendall’s tau 
(see section 6.5.6) are examples of bivariate correlation coefficients. 

Let’s return to the example from Chapter 4 about exam scores. Remember that a psy¬ 
chologist was interested in the effects of exam stress and revision on exam performance. 
She had devised and validated a questionnaire to assess state anxiety relating to exams 
(called the Exam Anxiety Questionnaire, or EAQ). This scale produced a measure of anxi¬ 
ety scored out of 100. Anxiety was measured before an exam, and the percentage mark 
of each student on the exam was used to assess the exam performance. She also measured 
the number of hours spent revising. These data are in Exam Anxiety.dat on the companion 
website. We already created scatterplots for these data (section 4.5) so we don’t need to 
do that again. 


Packages for correlation analysis in R © 


There are several packages that we will use in this chapter. Some of them can be accessed 
through R Commander (see the next section) but others can’t. For the examples in this 
chapter you will need the packages Hmisc, polycor, boot , ggplot2 and ggm. If you do not 
have these packages installed (some should be installed from previous chapters), you can 
install them by executing the following commands ( boot is part of the base package and 
doesn’t need to be installed): 

install.packagesC'Hmisc"); install.packages("ggm"); 
install.packages("ggplot2"); install.packages("polycor") 

You then need to load these packages by executing the commands: 

library(boot); library(ggm); library(ggplot2); library(Hmisc); 
library(polycor) 


6.5.2. 


General procedure for correlations using R 
Commander (D 


To conduct a bivariate correlation using R Commander, first initiate the package by execut¬ 
ing (and install it if you haven’t - see section 3.6): 

library(Rcmdr) 

You then need to load the data file into R Commander by using the Data=>Import 
data=>from text file, clipboard, or URL... menu (see section 3.7.3). Once the data are 
loaded in a dataframe (I have called the dataframe examData), you can use either the Statis 
tics=>Summaries=>Correlation matrix... or the Statistics=>Summaries=>Correlation test... 
menu to get the correlation coefficients. These menus and their dialog boxes are shown in 
Figure 6.4. 

The correlation matrix menu should be selected if you want to get correlation 
coefficients for more than two variables (in other words, produce a grid of correlation 
coefficients); the correlation test menu should be used when you want only a single corre¬ 
lation coefficient. Both menus enable you to compute Pearson’s product-moment correla¬ 
tion and Spearman’s correlation, and both can be used to produce p-values associated with 
these correlations. However, there are some important differences too: the correlation test 
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menu enables you to compute Kendall’s correlation, produces a confidence interval and 
allows you to select both two-tailed and one-tailed tests, but can be used to compute only 
one correlation coefficient at a time; in contrast, the correlation matrix cannot produce 
Kendall’s correlation but can compute partial correlations, and can also compute multiple 
correlations from a single command. 

Let’s look at the Correlation Matrix dialog box first. Having accessed the main dialog box, 
you should find that the variables in the dataframe are listed on the left-hand side of the 
dialog box (Figure 6.4). You can select the variables that you want from the list by clicking 
with the mouse while holding down the Ctrl key. R will create a grid of correlation coef¬ 
ficients for all of the combinations of variables that you have selected. This table is called a 
correlation matrix. For our current example, select the variables Exam, Anxiety and Revise. 
Having selected the variables of interest you can choose between three correlation coef¬ 
ficients: Pearson’s product-moment correlation coefficient (Pearsonproduct-moment o ), Spearman’s 
rho (spearman rank-order o) and a partial correlation (partial o). Any of these can be selected by 
clicking on the appropriate tick-box with a mouse. Finally, if you would like p-values for the 

correlation coefficients then select 3 for Pearson or Spearman correlations t 

For the correlation test dialog box you will again find that the variables in the dataframe 
are listed on the left-hand side of the dialog box (Figure 6.4). You can select only two by 
clicking with the mouse while holding down the Ctrl key. Having selected the two variables 
of interest, choose between three correlation coefficients: Pearson’s product-moment cor¬ 
relation coefficient (Pearson product-moment a), Spearman’s rho (spearman rank-order o) and Kendall’s 
tau (Kendairstau o). In addition, it is possible to specify whether or not the test is one- or two- 
tailed (see section 2.6.2). To recap, a two-tailed test (the default) should be used when you 
cannot predict the nature of the relationship (i.e., ‘I’m not sure whether exam anxiety will 
improve or reduce exam marks’). If you have a non-directional hypothesis like this, click 


74 R Commander 
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FIGURE 6.4 

Conducting 
a bivariate 
correlation using 
R Commander 


3 Selecting this option changes the function that R Commander uses to generate the output. If this option is not 
selected then the function cor() is used, but if it is selected rcorrQ is used (which is part of the Hmisc package). 
The main implication is that rcorrQ reports the results to only 2 decimal places (see the next section for a full 
description of these functions). 
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on Two-sided o. A one-tailed test should be selected when you have a directional hypothesis. 
With correlations, the direction of the relationship can be positive (e.g., ‘the more anxious 
someone is about an exam, the better their mark will be’) or negative (e.g., ‘the more 
anxious someone is about an exam, the worse their mark will be’). A positive relationship 
means that the correlation coefficient will be greater than 0; therefore, if you predict a 
positive correlation coefficient then select correlation < o o. However, if you predict a negative 
relationship then the correlation coefficient will be less than 0, so select correlation < o o. For 
both the correlation matrix and correlation test dialog boxes click on I 0K 1 to generate the 
output. 


6.5.3. 


General procedure for correlations using R © 


To compute basic correlation coefficients there are three main functions that can be used: 
cor(), cor.test() and rcorr(). Table 6.2 shows the main differences between the three func¬ 
tions. The functions cor() and cor.test() are part of the base system in R, but rcorr() is part 
of the Hmisc package, so make sure you have it loaded. 

Table 6.2 should help you to decide which function is best in a particular situation: if you 
want a confidence interval then you will have to use cor.test(), and if you want correlation 
coefficients for multiple pairs of variables then you cannot use cor.test()-, similarly, if you 
want p-values then cor() won’t help you. You get the gist. 


Table 6.2 Attributes of different functions for obtaining correlations 


Function 

Pearson 

Spearman 

Kendall 

p-values 

Multiple 

Cl Correlations? 

Comments 

cor() 

✓ 

✓ 



■/ 


cot test() 

✓ 

✓ 

V 

s 



rcorrQ 

✓ 

✓ 


s 

✓ 

2 d.p. only 


We will look at each function in turn and see what parameters it uses. Let’s start with 
cor(), which takes the general form: 

cor(x,y, use = "string", method = "correlation type") 
in which: 

• x is a numeric variable or dataframe. 

• y is another numeric variable (does not need to be specified if x above is a dataframe). 

• use is set equal to a character string that specifies how missing values are handled. 
The strings can be: (1) “everything”, which will mean that R will output an NA 
instead of a correlation coefficient for any correlations involving variables containing 
missing values; (2) “all.obs”, which will use all observations and, therefore, returns 
an error message if there are any missing values in the data; (3) “complete.obs”, in 
which correlations are computed from only cases that are complete for all variables - 
sometimes known as excluding cases listwise (see R’s Souls’ Tip 6.1); or (4) “pairwise, 
complete.obs”, in which correlations between pairs of variables are computed for 
cases that are complete for those two variables - sometimes known as excluding cases 
pairwise (see R’s Souls’ Tip 6.1). 
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• method enables you to specify whether you want “pearson”, “spearman” or “kend- 
all” correlations (note that all are written in lower case). If you want more than one 
type you can specify a list using the c() function; for example, c(“pearson”, “spear¬ 
man”) would produce both types of correlation coefficients. 

If we stick with our exam anxiety data, then we could get Pearson correlations between all 
variables by specifying the dataframe ( examData ): 

cor(examData, use = "complete.obs", method = "pearson") 

If we want a single correlation between a pair of variables (e.g., Exam and Anxiety) then 
we’d specify both variables instead of the dataframe: 

cor(examData$Exam, examData$Anxiety, use = "complete.obs", method = "pearson") 

We can get a different type of correlation (e.g., Kendall’s tau) by changing the method 
command: 

cor(examData$Exam, examData$Anxiety, use = "complete.obs", method = "kendall") 

We can also change how we deal with missing values, for example, by asking for pairwise 
exclusion: 

cor(examData$Exam, examData$Anxiety, use = "pairwise.complete. obs", 
method = "kendall") 


1 

i 

ih 


R’s Souls’ Tip 6.1 


1 


Exclude cases pairwise or listwise? 0 


As we discover various functions in this book, many of them have options that determine how missing data are 
handled. Sometimes we can decide to exclude cases ‘pairwise’ or listwise’. Listwise means that if a case has 
a missing value for any variable, then they are excluded from the whole analysis. So, for example, in our exam 
anxiety data if one of our students had reported their anxiety and we knew their exam performance but we didn’t 
have data about their revision time, then their data would not be used to calculate any of the correlations: they 
would be completely excluded from the analysis. Another option is to exclude cases on a pairwise basis, which 
means that if a participant has a score missing for a particular variable or analysis, then their data are excluded 
only from calculations involving the variable for which they have no score. For our student about whom we don’t 
have any revision data, this means that their data would be excluded when calculating the correlation between 
exam scores and revision time, and when calculating the correlation between exam anxiety and revision time; 
however, the student’s scores would be included when calculating the correlation between exam anxiety and 
exam performance because for this pair of variables we have both of their scores. 


The function rcorrQ is fairly similar to cor(). It takes the general form: 
rcorr(x,y, type = "correlation type") 
in which: 

• x is a numeric variable or matrix. 

• y is another numeric variable (does not need to be specified if x above is a matrix). 

• type enables you to specify whether you want “pearson” or “spearman” correlations. 
If you want both you can specify a list as c(“pearson”, “spearman”). 
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A couple of things to note: first, this function does not work on dataframes, so you have to 
convert your dataframe to a matrix first (see section 3.9.2); second, this function excludes 
cases pairwise (see R’s Souls’ Tip 6.1) and there is no way to change this setting. Therefore, if 
you have two numeric variables (that are not part of a dataframe) called Exam and Anxiety 
then you could compute the Pearson correlation coefficient and its p-value by executing: 

rcorr(Exam, Anxiety, type = "pearson") 

Similarly, you could compute Pearson correlations (and their p-values) between all vari¬ 
ables in a matrix called examData by executing: 

rcorrCexamData, type = "pearson") 

The function cor.test() can be used only on pairs of variables (not a whole dataframe) and 
takes the general form: 

cor.test(x, y, alternative = "string", method = "correlation type", conf. 
level = 0.95) 

in which: 


• x is a numeric variable. 

• y is another numeric variable. 

• alternative specifies whether you want to do a two-tailed test ( alternative = “two. 
sided”), which is the default, or whether you predict that the correlation will be less 
than zero (i.e., negative) or more than zero (i.e., positive), in which case you can use 
alternative = “less” and alternative — “greater”, respectively. 

• method is the same as for cor() described above. 

• conf.level allows you to specify the width of the confidence interval computed for 
the correlation. The default is 0.95 ( conf.level = 0.95 ) and if this is what you want 
then you don’t need to use this command, but if you wanted a 90% or 99% con¬ 
fidence interval you could use conf.level = 0.9 and conf.level = 0.99, respectively. 
Confidence intervals are produced only for Pearson’s correlation coefficient. 


Using our exam anxiety data, if we want a single correlation coefficient, its two-tailed 
p-value and 95% confidence interval between a pair of variables (for example, Exam and 
Anxiety) then we’d specify it much like we did for cor(): 

cor.test(examData$Exam, examData$Anxiety, method = "pearson") 

If we predicted a negative correlation then we could add in the alternative command: 

cor .test(examData$Exam, examData$Anxiety, alternative = "less"), method = 

"pearson") 

We could also specify a different confidence interval than 95%: 

cor .test(examData$Exam, examData$Anxiety, alternative = "less"), method = 

"pearson", conf.level = 0.99) 

Hopefully you get the general idea. We will now move on to look at some examples of 
specific types of correlation coefficients. 



OLIVER TWISTED 

Please Sir, can I have 
some more ... variance 
and covariance? 


Oliver is so excited to get onto analysing his data that he doesn’t want 
me to spend pages waffling on about variance and covariance. ‘Stop 
writing, you waffling fool,’ he says. ‘I want to analyse my data.' Well, he’s 
got a point. If you want to find out more about two functions for calculat¬ 
ing variances and covariances that are part of the cor() family, then the 
additional material for this chapter on the companion website will tell you. 
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Pearson’s correlation coefficient © 


6.5.4.1. Assumptions of Pearson’s r© 

Pearson’s (Figure 6.5) correlation coefficient was described in full at the beginning of this 
chapter. Pearson’s correlation requires only that data are interval (see section 1.5.1.2) for it 
to be an accurate measure of the linear relationship between two variables. However, if you 
want to establish whether the correlation coefficient is significant, then more assumptions 
are required: for the test statistic to be valid the sampling distribution has to be normally 
distributed and as we saw in Chapter 5 we assume that it is if our sample data are normally 
distributed (or if we have a large sample). Although typically, to assume that the sampling 
distribution is normal, we would want both variables to be normally distributed, there is 
one exception to this rule: one of the variables can be a categorical variable provided there 
are only two categories (in fact, if you look at section 6.5.7 you’ll see that this is the same 
as doing a t-test, but I’m jumping the gun a bit). In any case, if your data are non-normal 
(see Chapter 5) or are not measured at the interval level then you should use a different 
kind of correlation coefficient or use bootstrapping. 



FIGURE 6.5 

Karl Pearson 


6.5.4.2. Computing Pearson’s r using R © 

That’s a confusing title. We have already gone through the nuts and bolts of using R 
Commander and the command line to calculate Pearson’s r. We’re going to use the exam 
anxiety data to get some hands-on practice. 



SELF-TEST 

s Load the Exam Anxiety.dat file into a dataframe 
called examData. 
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Let’s look at a sample of this dataframe: 



Code 

Revise 

Exam 

Anxiety 

Gender 

1 

1 

4 

40 

86.298 

Male 

2 

2 

11 

65 

88.716 

Female 

3 

3 

27 

80 

70.178 

Male 

4 

4 

53 

80 

61.312 

Male 

5 

5 

4 

40 

89.522 

Male 

6 

6 

22 

70 

60.506 

Female 

7 

7 

16 

20 

81.462 

Female 

8 

8 

21 

55 

75.820 

Female 

9 

9 

25 

50 

69.372 

Female 

10 

10 

18 

40 

82.268 

Female 

The first issue we have is 

that some 

of the v; 


are not meaningful numerically (code). We have two choices here. The first is to make a 
new dataframe by selecting only the variables of interest) - we discovered how to do this 
in section 3.9.1. The second is to specify this subset within the cor() command itself. If we 
choose the first method then we should execute: 


examData2 <- examData[, c("Exam", "Anxiety", "Revise")] 
cor(examData2) 

The first line creates a dataframe ( examDatal ) that contains all of the cases, but only the 
variables Exam, Anxiety and Revise. The second command creates a table of Pearson cor¬ 
relations between these three variables (note that Pearson is the default so we don’t need to 
specify it and because there are no missing cases we do not need the use command). 

Alternatively, we could specify the subset of variables in the examData dataframe as part 
of the cor() function: 

cor(examData[, c("Exam", "Anxiety", "Revise")]) 

The end result is the same, so it’s purely down to preference. With the first method it is a 
little easier to see what’s going on, but as you gain confidence and experience you might 
find that you prefer to save time and use the second method. 

Exam Anxiety Revise 

Exam 1.0000000 -0.4409934 0.3967207 

Anxiety -0.4409934 1.0000000 -0.7092493 

Revise 0.3967207 -0.7092493 1.0000000 


Output 6.1: Output for a Pearson's correlation 


Output 6.1 provides a matrix of the correlation coefficients for the three variables. 
Each variable is perfectly correlated with itself (obviously) and so r = 1 along the diago¬ 
nal of the table. Exam performance is negatively related to exam anxiety with a Pearson 
correlation coefficient of r = —.441. This is a reasonably big effect. Exam performance 
is positively related to the amount of time spent revising, with a coefficient of r = .397, 
which is also a reasonably big effect. Finally, exam anxiety appears to be negatively related 
to the time spent revising, r = —.709, which is a substantial effect size. In psychologi¬ 
cal terms, this all means that as anxiety about an exam increases, the percentage mark 
obtained in that exam decreases. Conversely, as the amount of time revising increases, the 
percentage obtained in the exam increases. Finally, as revision time increases, the student’s 
anxiety about the exam decreases. So there is a complex interrelationship between the 
three variables. 

Correlation coefficients are effect sizes, so we can interpret these values without really 
needing to worry about p-values (and as I have tried to drum into you, because p-values 
are related to sample size, there is a lot to be said for not obsessing about them). However, 
if you are the type of person who obsesses about p-values, then you can use the rcorrQ 
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function instead and p yourself with excitement at the output it produces. First, make sure 
you have loaded the Hmisc package by executing: 

libraryCHmisc) 

Next, we need to convert our dataframe into a matrix using the as.matrix() command. 
We can include only numeric variables so, just as we did above, we need to select only the 
numeric variables within the examData dataframe. To do this, execute: 

examMatrix<-as.matrix(examData[, c("Exam", "Anxiety", "Revise")]) 

Which creates a matrix called examMatrix that contains only the variables Exam, Anxiety, 
and Revise from the examData dataframe. To get the correlation matrix we simply input 
this matrix into the rcorr() function: 4 

rcorrCexamMatrix) 

As before, I think that the method above makes it clear what we’re doing, but more expe¬ 
rienced users could combine the previous two commands into a single one: 

rcorrCas.matrix(examData[, c("Exam", "Anxiety", "Revise")])) 

Output 6.2 shows the same correlation matrix as Output 6.1, except rounded to 2 decimal 
places. In addition, we are given the sample size on which these correlations are based, and 
also a matrix of p-values that corresponds to the matrix of correlation coefficients above. 
Exam performance is negatively related to exam anxiety with a Pearson correlation coefficient 
of r = —.44 and the significance value is less than .001 (it is approximately zero). This signifi¬ 
cance value tells us that the probability of getting a correlation coefficient this big in a sample 
of 103 people if the null hypothesis were true (there was no relationship between these vari¬ 
ables) is very low (close to zero in fact). Hence, we can gain confidence that there is a genuine 
relationship between exam performance and anxiety. Our criterion for significance is usually 
.05 (see section 2.6.1) so we can say that all of the correlation coefficients are significant. 


Exam 


Exam Anxiety Revise 
1.00 -0.44 0.40 


Anxiety -0.44 1.00 -0.71 
Revise 0.40 -0.71 1.00 


n= 103 


Exam Anxiety Revise 
Exam 0 0 

Anxiety 0 0 

Revise 0 0 


Output 6.2 

It can also be very useful to look at confidence intervals for correlation coefficients. Sadly, 
we have to do this one at a time (we can’t do it for a whole dataframe or matrix). Let’s look 
at the correlation between exam performance (Exam) and exam anxiety (Anxiety). We can 
compute the confidence interval using cor.test() by executing: 

cor.test(examData$Anxiety, examData$Exam) 

4 The ggm package also has a function called rcorr(), so if you have this package installed, R might use that func¬ 
tion instead, which will produce something very unpleasant on your screen. If so, you need to put Hmisc:: in front 
of the commands to make sure R uses rcorr() from the Hmisc package (R’s Souls’ Tip 3.4): 

Hmisc::rcorr(examMatrix) 

Hmisc::rcorr(as.matrix(examData[, c("Exam", "Anxiety", "Revise")])) 
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Note that we have specified only the variables because by default this function produces 
Pearson’s r and a 95% confidence interval. Output 6.3 shows the resulting output; it reiter¬ 
ates that the Pearson correlation between exam performance and anxiety was -.441, but 
tells us that this was highly significantly different from zero, t(101) = -4.94, p < .001. 
Most important, the 95% confidence ranged from -.585 to - .271, which does not cross 
zero. This tells us that in all likelihood, the population or actual value of the correlation 
is negative, so we can be pretty content that exam anxiety and exam performance are, in 
reality, negatively related. 

Pearson's product-moment correlation 

data: examData$Anxiety and examData$Exam 

t = -4.938, df = 101, p-value = 3.128e-06 

alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval: 

-0.5846244 -0.2705591 
sample estimates: 
cor 

-0.4409934 

Output 6.3 


SELF-TEST 

s Compute the confidence intervals for the 
relationships between the time spent revising 
(Revise) and both exam performance (Exam) and 
exam anxiety (Anxiety). 


6.5.4.3. Using R 2 for interpretation © 


Although we cannot make direct conclusions about causality from a correlation, we can 
take the correlation coefficient a step further by squaring it. The correlation coefficient 
squared (known as the coefficient of determination, R 2 ) is a measure of the amount of vari¬ 
ability in one variable that is shared by the other. For example, we may look at the relation¬ 
ship between exam anxiety and exam performance. Exam performances vary from person 
to person because of any number of factors (different ability, different levels of preparation 
and so on). If we add up all of this variability (rather like when we calculated the sum of 
squares in section 2.4.1) then we would have an estimate of how much variability exists 
in exam performances. We can then use R 2 to tell us how much of this variability is shared 
by exam anxiety. These two variables had a correlation of —0.4410 and so the value of R 2 
will be ( —0.4410) 2 = 0.194. This value tells us how much of the variability in exam per¬ 
formance is shared by exam anxiety. 

If we convert this value into a percentage (multiply by 100) we can say that exam anxi¬ 
ety shares 19.4% of the variability in exam performance. So, although exam anxiety was 
highly correlated with exam performance, it can account for only 19.4% of variation in 
exam scores. To put this value into perspective, this leaves 80.6% of the variability still to 
be accounted for by other variables. 

You’ll often see people write things about R 2 that imply causality: they might write ‘the 
variance in y accounted for by x\ or ‘the variation in one variable explained by the other’. 
However, although R 2 is an extremely useful measure of the substantive importance of an 
effect, it cannot be used to infer causal relationships. Exam anxiety might well share 19.4% 
of the variation in exam scores, but it does not necessarily cause this variation. 
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We can get R to compute the coefficient of determination by remembering that “ ^ 2” 
means ‘squared’ in R-speak. Therefore, for our examDatal dataframe (see earlier) if we 
execute: 

cor(examData2) A 2 
instead of: 
cor(examData2) 

then you will see be a matrix containing r 2 instead of r (Output 6.4). 

Exam Anxiety Revise 
Exam 1.0000000 0.1944752 0.1573873 
Anxiety 0.1944752 1.0000000 0.5030345 
Revise 0.1573873 0.5030345 1.0000000 

Output 6.4 

Note that for exam performance and anxiety the value is 0.194, which is what we calcu¬ 
lated above. If you want these values expressed as a percentage then simply multiply by 
100, so the command would become: 

cor(examData2) A 2 * 100 


6.5.5. 


Spearman’s correlation coefficient © 


Spearman’s correlation coefficient (Spearman, 1910), r, is a non-parametric statis¬ 
tic and so can be used when the data have violated parametric assumptions such 
as non-normally distributed data (see Chapter 5). You’ll sometimes hear the test 
referred to as Spearman’s rho (pronounced ‘row’, as in ‘row your boat gently 
down the stream’). Spearman’s test works by first ranking the data (see section 
15.4.1), and then applying Pearson’s equation (equation (6.3)) to those ranks. 

I was born in England, which has some bizarre traditions. One such oddity is 
the World’s Biggest Liar competition held annually at the Santon Bridge Inn in 
Wasdale (in the Lake District). The contest honours a local publican, ‘Auld Will 
Ritson’, who in the nineteenth century was famous in the area for his far-fetched 
stories (one such tale being that Wasdale turnips were big enough to be hollowed out and 
used as garden sheds). Each year locals are encouraged to attempt to tell the biggest lie in the 
world (lawyers and politicians are apparently banned from the competition). Over the years 
there have been tales of mermaid farms, giant moles, and farting sheep blowing holes in the 
ozone layer. (I am thinking of entering next year and reading out some sections of this book.) 

Imagine I wanted to test a theory that more creative people will be able to create taller 
tales. I gathered together 68 past contestants from this competition and asked them where 
they were placed in the competition (first, second, third, etc.) and also gave them a creativity 
questionnaire (maximum score 60). The position in the competition is an ordinal variable 
(see section 1.5.1.2) because the places are categories but have a meaningful order (first place 
is better than second place and so on). Therefore, Spearman’s correlation coefficient should 
be used (Pearson’s r requires interval or ratio data). The data for this study are in the file 
The Biggest Liar.dat. The data are in two columns: one labelled Creativity and one labelled 
Position (there’s actually a third variable in there but we will ignore it for the time being). Lor 
the Position variable, each of the categories described above has been coded with a numerical 
value. Lirst place has been coded with the value 1, with positions being labelled 2, 3 and so on. 

The procedure for doing a Spearman correlation is the same as for a Pearson correlation 
except that we need to specify that we want a Spearman correlation instead of Pearson, 
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which is done using method = “spearman” for cor() and cor.test(), and type = “spearman” 
for rcorrQ. Let’s load the data into a dataframe and then create a dataframe by executing: 

liarData = read.delimC'The Biggest Liar.dat", header = TRUE) 

or if you haven’t set your working directory, execute this command and use the dialog box 
to select the file: 

liarData = read.delimffile.chooseQ, header = TRUE) 



SELF-TEST 

s See whether you can use what you have learned so 
far to compute a Spearman’s correlation between 

Position and Creativity. 


To obtain the correlation coefficient for a pair of variables we can execute: 

cor(liarData$Position, liarData$Creativity, method = "spearman") 

Note that we have simply specified the two variables of interest, and then set the method 
to be a Spearman correlation. The output of this command will be: 

[1] -0.3732184 

If we want a significance value for this correlation we could either use rcorr() by executing 
(remembering that we have to first convert the dataframe to a matrix): 

liarMatrix<-as.matrix(liarData[, c("Position", "Creativity")]) 
rcorr(liarMatrix) 

or simply use cor.test(), which has the advantage that we can set a directional hypothesis. 
I predicted that more creative people would tell better lies. Doing well in the competition 
(i.e., telling better lies) actually equates to a lower number for the variable Position (first 
place = 1, second place = 2 etc.), so we’re predicting a negative relationship. High scores 
on Creativity should equate to a lower value of Position (because a low value means you 
did well!). Therefore, we predict that the correlation will be less than zero, and we can 
reflect this prediction by using alternative = “less” in the command: 

cor.test(liarData$Position, liarData$Creativity, alternative = "less", 
method = "spearman") 


FIGURE 6.6 

Charles 
Spearman, 
ranking furiously 
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Spearman's rank correlation rho 
data: liarData$Position and liarData$Creativity 

S = 71948.4, p-value = 0.0008602 
alternative hypothesis: true rho is less than 0 
sample estimates: 
rho 

-0.3732184 

Output 6.5 

Output 6.5 shows the output for a Spearman correlation on the variables Creativity and 
Position. The output is very similar to that of the Pearson correlation (except that confidence 
intervals are not produced - if you want one see the section on bootstrapping): the correla¬ 
tion coefficient between the two variables is fairly large ( — .373), and the significance value of 
this coefficient is very small (p < .001). The significance value for this correlation coefficient 
is less than .05; therefore, it can be concluded that there is a significant relationship between 
creativity scores and how well someone did in the World’s Biggest Liar competition. Note 
that the relationship is negative: as creativity increased, position decreased. Remember that a 
low number means that you did well in the competition (a low number such as 1 means you 
came first, and a high number like 4 means you came fourth). Therefore, our hypothesis is 
supported: as creativity increased, so did success in the competition. 



SELF-TEST 

s Did creativity cause success in the World’s Biggest 
Liar competition? 


6.5.6. 


Kendall’s tau (non-parametric) © 


Kendall’s tau, T, is another non-parametric correlation and it should be used rather than 
Spearman’s coefficient when you have a small data set with a large number of tied ranks. 
This means that if you rank all of the scores and many scores have the same rank, then 
Kendall’s tau should be used. Although Spearman’s statistic is the more popular of the 
two coefficients, there is much to suggest that Kendall’s statistic is actually a better esti¬ 
mate of the correlation in the population (see Howell, 1997: 293). As such, we can draw 
more accurate generalizations from Kendall’s statistic than from Spearman’s. To carry out 
Kendall’s correlation on the World’s Biggest Liar data simply follow the same steps as for 
Pearson and Spearman correlations but use method = “kendall”: 

cor(liarData$Position, liarData$Creativity, method = "kendall") 

cor.test(liarData$Position, liarData$Creativity, alternative = "less", 
method = "kendall") 

The output is much the same as for Spearman’s correlation. 

Kendall's rank correlation tau 

data: liarData$Position and liarData$Creativity 

z = -3.2252, p-value = 0.0006294 
alternative hypothesis: true tau is less than 0 
sample estimates: 
tau 

-0.3002413 


Output 6.6 
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You’ll notice from Output 6.6 that the actual value of the correlation coefficient is closer 
to zero than the Spearman correlation (it has increased from —.373 to —.300). Despite the 
difference in the correlation coefficients we can still interpret this result as being a highly 
significant relationship (because the significance value of .001 is less than .05). However, 
Kendall’s value is a more accurate gauge of what the correlation in the population would 
be. As with the Pearson correlation, we cannot assume that creativity caused success in the 
World’s Best Liar competition. 



SELF-TEST 

s Conduct a Pearson correlation analysis of the advert 
data from the beginning of the chapter. 


| Bootstrapping correlations (D 


Another way to deal with data that do not meet the assumptions of Pearson’s r is to use 
bootstrapping. The boot() function takes the general form: 

object<-boot(data, function, replications) 

in which data specifies the dataframe to be used, function is a function that you write to 
tell boot() what you want to bootstrap, and replications is a number specifying how many 
bootstrap samples you want to take (I usually set this value to 2000). Executing this com¬ 
mand creates an object that has various properties. We can view an estimate of bias, and 
an empirically derived standard error by viewing object , and we can display confidence 
intervals based on the bootstrap by executing boot.ci(object). 

When using the boot() function with correlations (and anything else for that matter) the 
tricky bit is writing the function (R’s Souls’ Tip 6.2). If we stick with our biggest liar data 
and want to bootstrap Kendall tau, then our function will be: 

bootTau<-function(liarData,i)cor(liarData$Position[i] t liarData$Creativity[i], 
use = "complete.obs", method = "kendall") 

Executing this command creates an object called bootTau. The first bit of the function tells 
R what input to expect in the function: in this case we need to feed a dataframe ( liarData) 
into the function and a variable that has been called i (which refers to a particular bootstrap 
sample). The second part of the function specifies the cor() function, which is the thing we 
want to bootstrap. Notice that cor() is specified in exactly the same way as when we did the 
original Kendall correlation except that for each variable we have added [i], which again 
just refers to a particular bootstrap sample. If you want to bootstrap a Pearson or Spearman 
correlation you do it in exactly the same way except that you specify method = “pearson” 
or method = “spearman” when you define the function. 

To create the bootstrap object, we execute: 

library(boot) 

boot_kendall<-boot(TiarData, bootTau, 2000) 
boot_kendall 

The first command loads the boot package (in case you haven’t already initiated it). The 
second command creates an object {bootJzendall) based on bootstrapping the liarData 
dataframe using the bootTau function that we previously defined and executed. The second 
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line displays a summary of the boot_kenda.il object. To get the 95% confidence interval for 
the boot_kendall object we execute: 5 

boot.ci(boot_kendall) 

Output 6.7 shows the contents of both boot_kendall and also the output of the boot.ci() 
function. First, we get the original value of Kendall’s tau (—.300), which we computed in 
the previous section. We also get an estimate of the bias in that value (which in this case 
is very small) and the standard error (0.098) based on the bootstrap samples. The out¬ 
put from boot.ciQ gives us four different confidence intervals (the basic bootstrapped Cl, 
percentile and BCa). The good news is that none of these confidence intervals cross zero, 
which gives us good reason to think that the population value of this relationship between 
creativity and success at being a liar is in the same direction as the sample value. In other 
words, our original conclusions stand. 

ORDINARY NONPARAMETRIC BOOTSTRAP 


Call: 

boot(data = liarData, statistic = bootTau, R = 2000) 

Bootstrap Statistics : 

original bias std. error 

tl* -0.3002413 0.001058191 0.097663 


> boot.ci(boot_kendall) 

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS 
Based on 2000 bootstrap replicates 

CALL : 

boot.ci(boot.out = boot_kendall) 

Intervals : 

Level Normal Basic 

95% (-0.4927, -0.1099 ) (-0.4956, -0.1126 ) 

Level Percentile BCa 

95% (-0.4879, -0.1049 ) (-0.4777, -0.0941 ) 

Calculations and Intervals on Original Scale 
Warning message: 

In boot.ci(boot_kendall) : 

bootstrap variances needed for studentized intervals 

Output 6.7 



SELF-TEST 

s Conduct bootstrap analysis of the Pearson and 
Spearman correlations for the examData2 dataframe. 


5 If we want something other than a 95% confidence interval we can add conf = x, in which x is the value of the 
confidence interval as a proportion. For example, we can get a 99% confidence interval by executing: 


boot.ci(boot_kendaTL, conf = 0.99) 
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Writing functions (D 


What happens if there is not a function available in R to do what you want to do? Simple, write your own function. 
The ability to write your own functions is a very powerful feature of R. With a sufficient grasp of the R environment 
(and the maths behind whatever you’re trying to do) you can write a function to do virtually anything for you (apart 
from making coffee). To write a function you need to execute a string of commands that define the function. They 
take this general format: 


ncmieofFunction<-function(inputObjectl, input0bject2, etc.) 

{ 

a set of commands that do things to the input object(s) 
a set of commands that specify the output of the function 

} 


Basically, you name the function (any name you like, but obviously one that tells you what the function does is 
helpful). The function() tells R that you’re writing a function, and you need to place within the brackets anything 
you want as input to the function: this can be any object in R (a model, a dataframe, a numeric value, text, etc.). A 
function might just accept one object, or there might be many. The names you list in the brackets can be whatever 
you like, but again it makes sense to label them based on what they are (e.g., if you need to input a dataframe 
then it makes sense to give the input a label of dataframe so that you remember what it is that the function needs). 
You then use {} to contain a set of instructions that tell R what to do with the objects that have been input into the 
function. These are usually some kind of calculations followed by some kind of instruction about what to return 
from the function (the output). 

Imagine that R doesn’t have a function for computing the mean and we wanted to write one (this will keep 
things familiar). We could write this as: 

meanOfVariable<-function(variable) 

{ 

mean<-sum(variable)/length(variable) 

catC'Mean = ", mean) 

> 

Executing this command creates a function called meanOfVariable that expects a variable to be entered into it. 
The bits in {} tell R what to do with the variable that is entered into the function. The first line computes the mean 
using the function sum() to add the values in the variable that was entered into the function, and the function 
length() counts how many scores are in the variable. Therefore, mean<-sum(variable)/length(variable) translates 
as mean = sum of scores/number of scores (which, of course, is the definition of the mean). The final line uses 
the cat() function to print the text “Mean =” and the value of mean that we have just computed. 

Remember the data about the number of friends that statistics lecturers had that we used to explore the mean 
in Chapter 2 (section 2.4.1). We could enter these data by executing: 

lecturerFriends = c(l,2,3,3,4) 

Having executed our function, we can use it to find the mean. We simply execute: 
meanOfVariable(lecturerFriends) 

This tells R that we want to use the function meanOfVariable(), which we have just created, and that the variable 
we want to apply this function to is lecturerFriends. Executing this command gives us: 

Mean = 2.6 


In other words, R has printed the text ‘Mean =’ and the value of the mean computed by the function (just as we 
asked it to). This value is the same as the one we calculated in section 2.4.1, so the function has worked. The 
beauty of functions is that having executed the commands that define it, we can use this function over and over 
again within our session (which saves time). 
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As a final point, just to be clear, when we define our function we can name things anything we like. For 
example, although I named the input to the function ‘variable’ to remind myself what the function needs, I could 
have named it ‘HarryTheHungryHippo’ if I had wanted to. Provided that I carry this name through to the 
commands within the function, it will work: 

meanOfVariable<-function(HarryTheHungryHippo) 

{ 

mean<-sum(HarryTheHungryHippo)/length(HarryTheHungryHippo) 
cat("Mean = ", mean) 

> 

Note that within the function I now apply the sum() and lengthQ functions to HarryTheHungryHippo because this 
is the name that I gave to the input of the function. It will work, but people will be probably confused about what 
HarryTheHungryHippo is when they read your code. 


6.5.8. 


Biserial and point-biserial correlations © 


The biserial and point-biserial correlation coefficients are distinguished by only a concep¬ 
tual difference, yet their statistical calculation is quite different. These correlation coef¬ 
ficients are used when one of the two variables is dichotomous (i.e., it is categorical with 
only two categories). An example of a dichotomous variable is being pregnant, because a 
woman can be either pregnant or not (she cannot be ‘a bit pregnant’). Often it is necessary 
to investigate relationships between two variables when one of the variables is dichoto¬ 
mous. The difference between the use of biserial and point-biserial correlations depends 
on whether the dichotomous variable is discrete or continuous. This difference is very 
subtle. A discrete, or true, dichotomy is one for which there is no underlying continuum 
between the categories. An example of this is whether someone is dead or alive: a person 
can be only dead or alive, they can’t be ‘a bit dead’. Although you might describe a person 
as being ‘half-dead’ - especially after a heavy drinking session - they are clearly still alive 
if they are still breathing! Therefore, there is no continuum between the two categories. 
However, it is possible to have a dichotomy for which a continuum does exist. An example 
is passing or failing a statistics test: some people will only just fail while others will fail by 
a large margin; likewise some people will scrape a pass while others will excel. So although 
participants fall into only two categories there is an underlying continuum along which 
people lie. Hopefully, it is clear that in this case there is some kind of continuum underlying 
the dichotomy, because some people passed or failed more dramatically than others. The 
point-biserial correlation coefficient (r b ) is used when one variable is a discrete dichotomy 
(e.g., pregnancy), whereas the biserial correlation coefficient (r b ) is used when one variable 
is a continuous dichotomy (e.g., passing or failing an exam). 

Imagine that I was interested in the relationship between the gender of a cat and how 
much time it spent away from home (what can I say? I love cats so these things interest me). 
I had heard that male cats disappeared for substantial amounts of time on long-distance 
roams around the neighbourhood (something about hormones driving them to find mates) 
whereas female cats tended to be more homebound. So, I used this as a purr-feet (sorry!) 
excuse to go and visit lots of my friends and their cats. I took a note of the gender of the 
cat and then asked the owners to note down the number of hours that their cat was absent 
from home over a week. Clearly the time spent away from home is measured at an interval 
level - and let’s assume it meets the other assumptions of parametric data - while the gen¬ 
der of the cat is discrete dichotomy. A point-biserial correlation has to be calculated and 
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this is simply a Pearson correlation when the dichotomous variable is coded with 0 for one 
category and 1 for the other. 

Let’s load the data in the file pbcorr.csv and have a look at it. These data are in the CSV 
format, so we can load them as (assuming you have set the working directory correctly): 

catData = read.csv("pbcorr.csv", header = TRUE) 

Note that we have used the read.csv() function because the file is a .csv file. To look at the 
data execute: 

catData 

A sample of the data is as follows: 



time 

gender 

recode 

1 

41 

1 

0 

2 

40 

0 

1 

3 

40 

1 

0 

4 

38 

1 

0 

5 

34 

1 

0 

6 

46 

0 

1 

7 

42 

1 

0 

8 

42 

1 

0 

9 

47 

1 

0 

10 

42 

0 

1 

11 

45 

1 

0 

12 

46 

1 

0 

13 

44 

1 

0 

14 

54 

0 

1 


There are three variables: 


• time, which is the number of hours that the cat spent away from home (in a week). 

• gender, is the gender of the cat, coded as 1 for male and 0 for female. 

• recode, is the gender of the cat but coded the opposite way around (i.e., 0 for male 
and 1 for female). We will come to this variable later, but for now ignore it. 



SELF-TEST 

s Carry out a Pearson correlation on time and gender. 



Congratulations: if you did the self-test task then you have just conducted your first 
point-biserial correlation. See, despite the horrible name, it’s really quite easy to do. If you 
didn’t do the self-test then execute: 

cor.test(catData$time, catData$gender) 

You should find that you can see Output 6.8. The point-biserial correlation coefficient is 
r pb = .378, which has a significance value of .003. The significance test for this correlation 
is actually the same as performing an independent-samples t-test on the data (see Chapter 
9). The sign of the correlation (i.e., whether the relationship was positive or negative) will 
depend entirely on which way round the coding of the dichotomous variable was made. To 
prove that this is the case, the data file pbcorr.dat has an extra variable called recode which 
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is the same as the variable gender except that the coding is reversed (1 = female, 0 = male). 
If you repeat the Pearson correlation using recode instead of gender you will find that the 
correlation coefficient becomes —.378. The sign of the coefficient is completely dependent 
on which category you assign to which code and so we must ignore all information about 
the direction of the relationship. However, we can still interpret R 2 as before. So in this 
example, R 2 = .378 2 = .143. Hence, we can conclude that gender accounts for 14.3% of 
the variability in time spent away from home. 



SELF-TEST 

s Carry out a Pearson correlation on time and recode. 


Pearson's product-moment correlation 

data: catData$time and catData$gender 

t = 3.1138, df = 58, p-value = 0.002868 

alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval: 

0.137769 0.576936 
sample estimates: 
cor 

0.3784542 

Output 6.8 

Imagine now that we wanted to convert the point-biserial correlation into the biserial 
correlation coefficient (r b ) (because some of the male cats were neutered and so there might 
be a continuum of maleness that underlies the gender variable). We must use equation (6.9) 
in which p is the proportion of cases that fell into the largest category and q is the propor¬ 
tion of cases that fell into the smallest category. Therefore, p and q are simply the number 
of male and female cats. In this equation y is the ordinate of the normal distribution at 
the point where there is p% of the area on one side and q% on the other (this will become 
clearer as we do an example): 



To calculate p and q, we first need to use the table() function to compute the frequencies 
of males and female cats. We will store these frequencies in a new object called catFrequen- 
cies. We then use this object to compute the proportion of male and female cats using the 
prop.tableO function. We execute these two commands as follows: 

catFrequenciesc-table(catData$gender) 
prop.tableCcatFrequencies) 

The resulting output tells us that the proportion of male cats (1) was .467 (this is q because 
it is the smallest portion) and the proportion of females (0) was .533 (this is p because it is 
the largest portion): 

0 1 
0.5333333 0.4666667 

To calculate y, we use these values and the values of the normal distribution displayed in the 
Appendix. Figure 6.7 shows how to find the ordinate (the value in the column labelled y ) 
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when the normal curve is split with .467 as the smaller portion and .533 as the larger portion. 
The figure shows which columns represent p and q and we look for our values in these columns 
(the exact values of 0.533 and 0.467 are not in the table so instead we use the nearest values 
that we can find, which are .5319 and .4681, respectively). The ordinate value is in the column 
y and is .3977. 


FIGURE 6.7 

Getting the 
‘ordinate’ of 
the normal 
distribution 





OB 53188 

09 53586 
.10 53983 
11 54380 


54776 

45224 

3961 

55172 

44828 

3956 

.55567 

44433 

3951 

5962 

44038 

3945 

5356 

43644 

3939 

1 5749 

43251 

3932 

.57142 

42858 

3925 

57535 

42465 

3918 

57926 

42074 

3910 

58317 

41683 

3902 

58706 

.41294 

3894 

59095 

40905 

3885 


If we replace these values in equation (6.9) we get .475 (see below), which is quite a lot 
higher than the value of the point-biserial correlation (0.378). This finding just shows you 
that whether you assume an underlying continuum or not can make a big difference to the 
size of effect that you get: 


r b 



.378V.533x.467 

.3977 


.475 


If this process freaks you out, then luckily you can get R to do it for you by installing the 
polycor package and using the polyserial() function. You can simply specify the two vari¬ 
ables of interest within this function just as you have been doing for every other correlation 
in this chapter. Execute this command: 

polyserialCcatData$time, catData$gender) 

and the resulting output: 

[1] 0.4749256 

confirms out earlier calculation. 
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You might wonder, given that you can get R to calculate the biserial correlation in one 
line of code, why I got you to calculate it by hand. It’s entirely plausible that I’m just a nasty 
person who enjoys other people’s pain. An alternative explanation is that the values of p 
and q are about to come in handy so it was helpful to show you how to calculate them. I’ll 
leave you to decide which explanation is most likely. 

To get the significance of the biserial correlation we need to first work out its standard 
error. If we assume the null hypothesis (that the biserial correlation in the population is 
zero) then the standard error is given by (Terrell, 1982): 


SE„ 


yM 


( 6 . 10 ) 


This equation is fairly straightforward because it uses the values of p, q and y that we 
already used to calculate the biserial r. The only additional value is the sample size (N), 
which in this example was 60. So our standard error is: 


SE„ 


V.533 x .467 
.3977 x ^60 


The standard error helps us because we can create a z-score (see section 1.7.4). To get a 
Z-score we take the biserial correlation, subtract the mean in the population and divide by 
the standard error. We have assumed that the mean in the population is 0 (the null hypoth¬ 
esis), so we can simply divide the biserial correlation by its standard error: 


z r = 

r b 



r b~° 

SE 

r b 



.475 

.162 


2.93 


We can look up this value of z (2.93) in the table for the normal distribution in the Appendix 
and get the one-tailed probability from the column labelled ‘Smaller Portion’. In this case 
the value is .00169. To get the two-tailed probability we simply multiply the one-tailed 
probability value by 2, which gives us .00338. As such the correlation is significant, p < .01. 



CRAMMING SAM’S TIPS 


Correlaion coefficients 


• We can measure the relationship between two variables using correlation coefficients. 

• These coefficients lie between -1 and +1. 

• Pearson’s correlation coefficient, r, is a parametric statistic and requires interval data for both variables. To test its signifi¬ 
cance we assume normality too. 

• Spearman’s correlation coefficient, r, is a non-parametric statistic and requires only ordinal data for both variables. 

• Kendall’s correlation coefficient, x, is like Spearman's r s but probably better for small samples. 

• The point-biserial correlation coefficient, r pb , quantifies the relationship between a continuous variable and a variable that is 
a discrete dichotomy (e.g., there is no continuum underlying the two categories, such as dead or alive). 

• The biserial correlation coefficient, r b , quantifies the relationship between a continuous variable and a variable that is a con¬ 
tinuous dichotomy (e.g., there is a continuum underlying the two categories, such as passing or failing an exam). 
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6.6. Partial correlation © 


The theory behind part and partial correlation © 



I mentioned earlier that there is a type of correlation that can be done that allows you 
to look at the relationship between two variables when the effects of a third variable are 
held constant. For example, analyses of the exam anxiety data (in the file Exam Anxiety, 
dat) showed that exam performance was negatively related to exam anxiety, but positively 
related to revision time, and revision time itself was negatively related to exam anxiety. 
This scenario is complex, but given that we know that revision time is related to both 
exam anxiety and exam performance, then if we want a pure measure of the relationship 
between exam anxiety and exam performance we need to take account of the influence of 
revision time. Using the values of R 1 for these relationships (refer back to Output 6.4), we 
know that exam anxiety accounts for 19.4% of the variance in exam performance, that 
revision time accounts for 15.7% of the variance in exam performance, and that revision 
time accounts for 50.2% of the variance in exam anxiety. If revision time accounts for half 
of the variance in exam anxiety, then it seems feasible that at least some of the 19.4% of 
variance in exam performance that is accounted for by anxiety is the same variance that 
is accounted for by revision time. As such, some of the variance in exam performance 
explained by exam anxiety is not unique and can be accounted for by revision time. A cor¬ 
relation between two variables in which the effects of other variables are held constant is 
known as a partial correlation. 

Let’s return to our example of exam scores, revision time and exam anxiety to illus¬ 
trate the principle behind partial correlation (Figure 6.8). In part 1 of the diagram there 
is a box for exam performance that represents the total variation in exam scores (this 
value would be the variance of exam performance). There is also a box that represents 
the variation in exam anxiety (again, this is the variance of that variable). We know 
already that exam anxiety and exam performance share 19.4% of their variation (this 
value is the correlation coefficient squared). Therefore, the variations of these two vari¬ 
ables overlap (because they share variance) creating a third box (the blue cross hatched 
box). The overlap of the boxes representing exam performance and exam anxiety is the 
common variance. Likewise, in part 2 of the diagram the shared variation between exam 
performance and revision time is illustrated. Revision time shares 15.7% of the variation 
in exam scores. This shared variation is represented by the area of overlap (the dotted- 
blue lines box). We know that revision time and exam anxiety also share 50% of their 
variation; therefore, it is very probable that some of the variation in exam performance 
shared by exam anxiety is the same as the variance shared by revision time. 

Part 3 of the diagram shows the complete picture. The first thing to note is that the boxes 
representing exam anxiety and revision time have a large overlap (this is because they share 
50% of their variation). More important, when we look at how revision time and anxiety 
contribute to exam performance we see that there is a portion of exam performance that 
is shared by both anxiety and revision time (the white area). However, there are still small 
chunks of the variance in exam performance that are unique to the other two variables. 
So, although in part 1 exam anxiety shared a large chunk of variation in exam perform¬ 
ance, some of this overlap is also shared by revision time. If we remove the portion of 
variation that is also shared by revision time, we get a measure of the unique relationship 
between exam performance and exam anxiety. We use partial correlations to find out the 
size of the unique portion of variance. Therefore, we could conduct a partial correlation 
between exam anxiety and exam performance while ‘controlling’ for the effect of revision 
time. Likewise, we could carry out a partial correlation between revision time and exam 
performance while ‘controlling’ for the effects of exam anxiety. 
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1 


Exam 

Performance 



Variance Accounted for by 
Exam Anxiety (19.4%) 


Exam Anxiety 


FIGURE 6.8 

Diagram showing 
the principle of 
partial correlation 




Variance Accounted for by 
Revision Time (15.7%) 



Unique Variance Accounted 
for by Revision Time 


Variance Accounted for by 
both Exam Anxiety and 
Revision Time 


Unique Variance Accounted 
for by Exam Anxiety 



6 . 6 . 2 . 


Partial correlation using R (D 


We will use the examDatal dataframe again, so if you haven’t got this loaded then execute 
these commands: 

examData = read.delim("Exam Anxiety.dat", header = TRUE) 
examDataZ <- examData[, c("Exam", "Anxiety", "Revise")] 

This will import the Exam Anxiety.dat file and create a dataframe containing only the 
three variables of interest. We will conduct a partial correlation between exam anxiety and 
exam performance while ‘controlling’ for the effect of revision time. To compute a partial 
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correlation and its significance we will use the pcor() and pcor.test() functions respectively. 
These are part of the ggm package, so first load this: 

library(ggm) 

The general form of pcor() is: 

pcor(c("varl" , "var2", "controll", "contro!2" etc.), var(dataframe)) 

Basically, you enter a list of variables as strings (note the variable names have to be in quotes) 
using the c() function. The first two variables should be those for which you want the partial 
correlation; any others listed should be variables for which you’d like to ‘control’. You can 
‘control’ for the effects of a single variable, in which case the resulting coefficient is known 
as a first-order partial correlation ; it is also possible to control for the effects of two (a 
second-order partial correlation), three (a third-order partial correlation), or more variables 
at the same time. The second part of the function simply asks for the name of the dataframe 
(in this case examDatal). For the current example, we want the correlation between exam 
anxiety and exam performance (so we list these variables first) controlling for exam revision 
(so we list this variable afterwards). As such, we can execute the following command: 

pcor(c("Exam" , "Anxiety", "Revise"), var(examData2)) 

Executing this command will print the partial correlation to the console. However, I find it 
useful to create an object containing the partial correlation value so that we can use it in other 
commands. As such, I suggest that you execute this command to create an object called pc: 

pc<-pcor(c("Exam", "Anxiety", "Revise"), var(examData2)) 

We can then see the partial correlation and the value of R 2 in the console by executing: 

PC 

pc A 2 

The general form of pcor.test() is: 

pcorCpcor object, number of control variables, sample size) 

Basically, you enter an object that you have created with pcor() (or you can put the pcor() com¬ 
mand directly into the function). We created a partial correlation object called pc, had only 
one control variable (Revise) and there was a sample size of 103; therefore we can execute: 

pcor.testCpc, 1, 103) 

to see the significance of the partial correlation. 

> PC 

[ 1 ] - 0.2466658 

> pc A 2 

[ 1 ] 0.06084403 

> pcor.testCpc, 1, 103) 

$tval 

[ 1 ] - 2.545307 

$df 

[ 1 ] 100 
$pvalue 

[ 1 ] 0.01244581 


Output 6.9 
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Output 6.9 shows the output for the partial correlation of exam anxiety and exam per¬ 
formance controlling for revision time; it also shows the squared value that we calculated 
(pc~2), and the significance value obtained from pcor.testQ. The output of pcor() is the par¬ 
tial correlation for the variables Anxiety and Exam but controlling for the effect of Revision. 
First, notice that the partial correlation between exam performance and exam anxiety is 
—.247, which is considerably less than the correlation when the effect of revision time is not 
controlled for (r = -.441). In fact, the correlation coefficient is nearly half what it was before. 
Although this correlation is still statistically significant (its p-value is .012, which is still below 
.05), the relationship is diminished. In terms of variance, the value of R 2 for the partial cor¬ 
relation is .06, which means that exam anxiety can now account for only 6% of the vari¬ 
ance in exam performance. When the effects of revision time were not controlled for, exam 
anxiety shared 19.4% of the variation in exam scores and so the inclusion of revision time 
has severely diminished the amount of variation in exam scores shared by anxiety. As such, a 
truer measure of the role of exam anxiety has been obtained. Running this analysis has shown 
us that exam anxiety alone does explain some of the variation in exam scores, but there is a 
complex relationship between anxiety, revision and exam performance that might otherwise 
have been ignored. Although causality is still not certain, because relevant variables are being 
included, the third variable problem is, at least, being addressed in some form. 

These partial correlations can be done when variables are dichotomous (including the 
‘third’ variable). So, for example, we could look at the relationship between bladder relax¬ 
ation (did the person wet themselves or not?) and the number of large tarantulas crawling 
up your leg, controlling for fear of spiders (the first variable is dichotomous, but the second 
variable and ‘controlled for’ variables are continuous). Also, to use an earlier example, we 
could examine the relationship between creativity and success in the World’s Biggest Liar 
competition, controlling for whether someone had previous experience in the competition 
(and therefore had some idea of the type of tale that would win) or not. In this latter case 
the ‘controlled for’ variable is dichotomous. 6 


6 . 6 . 3 . 


Semi-partial (or part) correlations © 


In the next chapter, we will come across another form of correlation known as a semi- 
partial correlation (also referred to as a part correlation). While I’m babbling on about partial 
correlations it is worth my explaining the difference between this type of correlation and 
semi-partial correlation. When we do a partial correlation between two variables, we con¬ 
trol for the effects of a third variable. Specifically, the effect that the third variable has on 
both variables in the correlation is controlled. In a semi-partial correlation we control for 
the effect that the third variable has on only one of the variables in the correlation. Figure 
6.9 illustrates this principle for the exam performance data. The partial correlation that we 



Partial Correlation Semi-Partial Correlation 


FIGURE 6.9 

The difference 
between a partial 
and a semi-partial 
correlation 


6 Both these examples are, in fact, simple cases of hierarchical regression (see the next chapter) and the first 
example is also an example of analysis of covariance. This may be confusing now, but as we progress through the 
book I hope it’ll become clearer that virtually all of the statistics that you use are actually the same things dressed 
up in different names. 
















DISCOVERING STATISTICS USING R 


calculated took account not only of the effect of revision on exam performance, but also 
of the effect of revision on anxiety. If we were to calculate the semi-partial correlation for 
the same data, then this would control for only the effect of revision on exam performance 
(the effect of revision on exam anxiety is ignored). Partial correlations are most useful for 
looking at the unique relationship between two variables when other variables are ruled 
out. Semi-partial correlations are, therefore, useful when trying to explain the variance in 
one particular variable (an outcome) from a set of predictor variables. (Bear this in mind 
when you read Chapter 7.) 



CRAMMING SAM’S TIPS 


Partial and semi-partial correlation 


A partial correlation quantifies the relationship between two variables while controlling for the effects of a third variable on 
both variables in the original correlation. 

A semi-partial correlation quantifies the relationship between two variables while controlling for the effects of a third variable 
on only one of the variables in the original correlation. 


6.7. Comparing correlations © 

| Comparing independent rs © 


Sometimes we want to know whether one correlation coefficient is bigger than another. 
For example, when we looked at the effect of exam anxiety on exam performance, we 
might have been interested to know whether this correlation was different in men and 
women. We could compute the correlation in these two samples, but then how would we 
assess whether the difference was meaningful? 



SELF-TEST 

s Use the subset() function to compute the correlation 
coefficient between exam anxiety and exam 
performance in men and women. 


If we did this, we would find that the correlations were r Malc = -.506 and r Femak = -.381. 
These two samples are independent; that is, they contain different entities. To compare 
these correlations we can again use what we discovered in section 6.3.3 to convert these 
coefficients to z r (just to remind you, we do this because it makes the sampling distribution 
normal and, therefore, we know the standard error). If we do the conversion, then we get 
Z r (males) = -.557 and z r (females) = -.401. We can calculate a z-score of the differences 
between these correlations as: 


-z r 


+ 


^■Difference 


( 6 . 11 ) 
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We had 52 men and 51 women so we would get: 


^■Difference 


-.557-(-.401) 



48 


-.156 

0.203 


= -0.768 


We can look up this value of z (0.768; we can ignore the minus sign) in the table for the 
normal distribution in the Appendix and get the one-tailed probability from the column 
labelled ‘Smaller Portion’. In this case the value is .221. To get the two-tailed probability 
we simply multiply the one-tailed probability value by 2, which gives us .442. As such the 
correlation between exam anxiety and exam performance is not significantly different in 
men and women (see Oliver Twisted for how to do this using R). 



OLIVER TWISTED 

Please Sir, can I 
have some more ... 
functions? 


‘These equations are rubbish,' says Oliver, ‘they’re too confusing and I 
hate them. Can’t we get R to do it for us while we check Facebook?’ 
Well, no, you can’t. Except you sort of can by writing your own function. 
‘Write my own function!!’ screams Oliver whilst trying to ram his computer 
keyboard into his mouth. ‘You’ve got to be joking, you steaming dog 
colon, I can barely write my own name.’ Luckily for you Oliver, I’ve done 
it for you. To find out more, read the additional material for this chap¬ 
ter on the companion website. Or check Facebook, the choice is yours. 



Comparing dependent rs (D 


If you want to compare correlation coefficients that come from the same entities then 
things are a little more complicated. You can use a t-statistic to test whether a difference 
between two dependent correlations from the same sample is significant. For example, 
in our exam anxiety data we might want to see whether the relationship between exam 
anxiety (x) and exam performance ( y ) is stronger than the relationship between revision 
( z ) and exam performance. To calculate this, all we need are the three rs that quantify the 
relationships between these variables: r , the relationship between exam anxiety and exam 
performance (-.441); r_ y , the relationship between revision and exam performance (.397); 
and r , the relationship between exam anxiety and revision (-.709). The t-statistic is com¬ 
puted as (Chen & Popovich, 2002): 


^Difference 


= (r» - r i 


(n — 3)(1 + r ) 


xy zy' 


j H-P r xy r xz r zy ^ r xy r xz r zy 


( 6 . 12 ) 


Admittedly that equation looks hideous, but really it’s not too bad: it just uses the three 
correlation coefficients and the sample size N. 

Put in the numbers from the exam anxiety example (N was 103) and you should end up 
with: 


^Difference 


= (-.838) 


29.1 


2(1-.194-.503-.158 + 0.248) 


-5.09 
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This value can be checked against the appropriate critical value in the Appendix with N— 3 
degrees of freedom (in this case 100). The critical values in the table are 1.98 (p < .05) and 
2.63 (p < .01), two-tailed. As such we can say that the correlation between exam anxiety 
and exam performance was significantly higher than the correlation between revision time 
and exam performance (this isn’t a massive surprise, given that these relationships went in 
the opposite directions to each other). 



OLIVER TWISTED 

Please Sir, can I have 
some more ... comparing 
of correlations? 


‘Are you having a bloody laugh with that equation?’ yelps Oliver. 
‘I’d rather smother myself with cheese sauce and lock myself 
in a room of hungry mice.' Yes, yes, Oliver, enough of your sex¬ 
ual habits. To spare the poor mice I have written another R func¬ 
tion to run the comparison mentioned in this section. For a guide 
on how to use them read the additional material for this chap¬ 
ter on the companion website. Go on, be kind to the mice! 


6.8. Calculating the effect size © 


Calculating effect sizes for correlation coefficients couldn’t be easier because, as we saw 
earlier in the book, correlation coefficients are effect sizes! So, no calculations (other than 
those you have already done) necessary! However, I do want to point out one caveat when 
using non-parametric correlation coefficients as effect sizes. Although the Spearman and 
Kendall correlations are comparable in many respects (their power, for example, is similar 
under parametric conditions), there are two important differences (Strahan, 1982). 

First, we saw for Pearson’s r that we can square this value to get the proportion of shared 
variance, R 1 . For Spearman’s r we can do this too because it uses the same equation as 
Pearson’s r. However, the resulting K s 2 needs to be interpreted slightly dif¬ 
ferently: it is the proportion of variance in the ranks that two variables share. 
Having said this, R 2 is usually a good approximation for R 2 (especially in con¬ 
ditions of near-normal distributions). Kendall’s r, however, is not numerically 
similar to either r or r and so r 2 does not tell us about the proportion of vari¬ 
ance shared by two variables (or the ranks of those two variables). 

Second, Kendall’s r is 66-75% smaller than both Spearman’s r and Pearson’s 
r, but r and r are generally similar sizes (Strahan, 1982). As such, if r is used 
as an effect size it should be borne in mind that it is not comparable to r and r 
and should not be squared. A related issue is that the point-biserial and biserial 
correlations differ in size too (as we saw in this chapter, the biserial correlation 
was bigger than the point-biserial). In this instance you should be careful to decide whether 
your dichotomous variable has an underlying continuum, or whether it is a truly discrete 
variable. More generally, when using correlations as effect sizes you should remember 
(both when reporting your own analysis and when interpreting others) that the choice of 
correlation coefficient can make a substantial difference to the apparent size of the effect. 

6.9. How to report correlation coefficents © 


Reporting correlation coefficients is pretty easy: you just have to say how big they are and 
what their significance value was (although the significance value isn’t that important because 
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the correlation coefficient is an effect size in its own right!). Five things to note are that: (1) 
if you follow the conventions of the American Psychological Association, there should be no 
zero before the decimal point for the correlation coefficient or the probability value (because 
neither can exceed 1); (2) coefficients are reported to 2 decimal places; (3) if you are quoting 
a one-tailed probability, you should say so; (4) each correlation coefficient is represented by 
a different letter (and some of them are Greek); and (5) there are standard criteria of prob¬ 
abilities that we use (.05, .01 and .001). Let’s take a few examples from this chapter: 

•S There was a significant relationship between the number of adverts watched and the 
number of packets of sweets purchased, r = .87, p (one-tailed) < .05. 

•S Exam performance was significantly correlated with exam anxiety, r = -.44, and time 
spent revising, r = .40; the time spent revising was also correlated with exam anxiety, 
r = -.71 (all ps < .001). 

•S Creativity was significantly related to how well people did in the World’s Biggest Liar 
competition, r = -.37, p < .001. 

•S Creativity was significantly related to how well people did in the World’s Biggest Liar 
competition, r = -.30, p < .001. (Note that I’ve quoted Kendall’s There.) 

•S The gender of the cat was significantly related to the time the cat spent away from 
home, r pb = .38, p < .01. 

•S The gender of the cat was significantly related to the time the cat spent away from 
home, r b = .48, p < .01. 

Scientists, rightly or wrongly, tend to use several standard levels of statistical significance. 
Primarily, the most important criterion is that the significance value is less than .05; however, 
if the exact significance value is much lower then we can be much more confident about the 
strength of the effect. In these circumstances we like to make a big song and dance about the 
fact that our result isn’t just significant at .05, but is significant at a much lower level as well 
(hooray!). The values we use are .05, .01, .001 and .0001. You are rarely going to be in the 
fortunate position of being able to report an effect that is significant at a level less than .0001! 

When we have lots of correlations we sometimes put them into a table. For example, our 
exam anxiety correlations could be reported as in Table 6.3. Note that above the diagonal 
I have reported the correlation coefficients and used symbols to represent different levels 
of significance. Under the table there is a legend to tell readers what symbols represent. 
(Actually, none of the correlations were non-significant or had p bigger than .001, so most 
of these are here simply to give you a reference point - you would normally include sym¬ 
bols that you had actually used in the table in your legend.) Finally, in the lower part of the 
table I have reported the sample sizes. These are all the same (103), but sometimes when 
you have missing data it is useful to report the sample sizes in this way because different 
values of the correlation will be based on different sample sizes. For some more ideas on 
how to report correlations have a look at Labcoat Leni’s Real Research 6.1. 


Table 6.3 An example of reporting a table of correlations 



Exam 

Performance 

Exam Anxiety 

Revision Time 

Exam Performance 

f 

— 44 *** 

40*** 

Exam Anxiety 

103 

i 

_ -j- 1 *** 

Revision Time 

103 

103 

i 

ns = not significant (p > 

.05), *p < .05, **p < 

. 01 , ***p < .001 
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Labcoat Leni’s Real Research 6.1 


Why do you like your 
lecturers? © 


Chamorro-Premuzic, T, et al. (2008). Personality and Individual Differences, 44, 965-976. 


As students you probably have to rate your lecturers at the end of the course. There will be some lecturers you like 
and others that you hate. As a lecturer I find this process horribly depressing (although this has a lot to do with 
the fact that I tend focus on negative feedback and ignore the good stuff). There is some evidence that students 
tend to pick courses of lecturers whom they perceive to be enthusastic and good communicators. In a fascinat¬ 
ing study, Tomas Chamorro-Premuzic and his colleagues (Chamorro-Premuzic, Furnham, Christopher, Garwood, 
& Martin, 2008) tested a slightly different hypothesis, which was that students tend to like lecturers who are like 
themselves. (This hypothesis will have the students on my course who like my lectures screaming in horror.) 

First of all, the authors measured students’ own personalities using a very well-established measure (the 
NEO-FFI) which gives rise to scores on five fundamental personality traits: Neuroticism, Extroversion, Openness 
to experience, Agreeableness and Conscientiousness. They also gave students a questionnaire that asked them 
to rate how much they wanted their lecturer to have each of a list of characteristics. For example, they would 
be given the description ‘warm: friendly, warm, sociable, cheerful, affectionate, outgoing’ and asked to rate 
how much they wanted to see this in a lecturer from -5 (they don’t want this characteristic at all) through 0 (the 
characteristic is not important) to +5 (I really want this characteristic in my lecturer). The characteristics on the 
questionnaire all related to personality characteristics measured by the NEO-FFI. As such, the authors had a 
measure of how much a student had each of the five core personality characteristics, but also a measure of how 
much they wanted to see those same characteristics in their lecturer. 

In doing so, Tomas and his colleagues could test whether, for instance, extroverted students want extrovert 
lecturers. The data from this study (well, for the variables that I’ve mentioned) are in the file Chamorro-Premuzic. 
dat. Run some Pearson correlations on these variables to see if students with certain personality characteristics 
want to see those characteristics in their lecturers. What conclusions can you draw? 

Answers are in the additional material on the companion website (or look at Table 3 in the original article, which 
will also show you how to report a large number of correlations). 



What have I discovered about statistics? © 


This chapter has looked at ways to study relationships between variables. We began 
by looking at how we might measure relationships statistically by developing what 
we already know about variance (from Chapter 1) to look at variance shared between 
variables. This shared variance is known as covariance. We then discovered that when 
data are parametric we can measure the strength of a relationship using Pearson’s cor¬ 
relation coefficient, r. When data violate the assumptions of parametric tests we can 
use Spearman’s r, or for small data sets Kendall’s Tmay be more accurate. We also saw 
that correlations can be calculated between two variables when one of those variables is 
a dichotomy (i.e., composed of two categories); when the categories have no underly¬ 
ing continuum then we use the point-biserial correlation, r h , but when the categories 
do have an underlying continuum we use the biserial correlation, r h . Finally, we looked 
at the difference between partial correlations, in which the relationship between two 
variables is measured controlling for the effect that one or more variables has on both 
of those variables, and semi-partial correlations, in which the relationship between two 
variables is measured controlling for the effect that one or more variables has on only 
one of those variables. We also discovered that I had a guitar and, like my favourite 
record of the time, I was ready to ‘Take on the World’. Well, Wales at any rate ... 
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R packages used in this chapter 

boot Polycor 

ggm Rcmdr 

ggplot2 
Hmisc 


R functions used in this chapter 


bootO 

polyserialO 

boot.ci() 

prop.tableO 

corO 

rcorr() 

cor.testO 

read.csvO 

pcorO 

read.delim() 

pcor.testO 

tableQ 


Key terms that I’ve discovered 


Biserial correlation 
Bivariate correlation 
Coefficient of determination 
Correlation coefficient 
Covariance 

Cross-product deviations 
Dichotomous 


Kendall's tau 

Partial correlation 

Pearson correlation coefficient 

Point-biserial correlation 

Semi-partial correlation 

Spearman’s correlation coefficient 

Standardization 


Smart Alex’s tasks © 


• Task 1: A student was interested in whether there was a positive relationship between 
the time spent doing an essay and the mark received. He got 45 of his friends and 
timed how long they spent writing an essay (hours) and the percentage they got 
in the essay (essay). He also translated these grades into their degree classifications 
(grade): in the UK, a student can get a first-class mark (the best), an upper-second- 
class mark, a lower second, a third, a pass or a fail (the worst). Using the data in the 
file EssayMarks.dat find out what the relationship was between the time spent doing 
an essay and the eventual mark in terms of percentage and degree class (draw a scat- 
terplot too!). © 



• Task 2: Using the ChickFlick.dat data from Chapter 3, is there a relationship between 
gender and arousal? Using the same data, is there a relationship between the film 
watched and arousal? © 


• Task 3: As a statistics lecturer I am always interested in the factors that determine 
whether a student will do well on a statistics course. One potentially important factor 
is their previous expertise with mathematics. Imagine I took 25 students and looked 
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at their degree grades for my statistics course at the end of their first year at univer¬ 
sity: first, upper second, lower second or third class. I also asked these students what 
grade they got in their GCSE maths exams. In the UK, GCSEs are school exams taken 
at age 16 that are graded A, B, C, D, E or F (an A grade is better than all of the lower 
grades). The data for this study are in the file grades.csv. Carry out the appropriate 
analysis to see if GCSE maths grades correlate with first-year statistics grades. © 

Answers can be found on the companion website. 


Further reading 


Chen, P. Y., & Popovich, P. M. (2002). Correlation: Parametric and nonparametric measures. 
Thousand Oaks, CA: Sage. 

Howell, D. C. (2006). Statistical methods for psychology (6th ed.). Belmont, CA: Duxbury. (Or you 
might prefer his Fundamental Statistics for the Behavioral Sciences, also in its 6th edition, 2007. 
Both are excellent texts that are a bit more technical than this book, so they are a useful next step.) 
Miles, J. N. V, & Banyard, P. (2007). Understanding and using statistics in psychology: A practical 
introduction. London: Sage. (A fantastic and amusing introduction to statistical theory.) 

Wright, D. B.,& London, K. (2009). First steps in statistics (2nd ed.). London: Sage. (This book is a 
very gentle introduction to statistical theory.) 


Interesting real research 


Chamorro-Premuzic, T., Furnham, A., Christopher, A. N., Garwood, J., & Martin, N. (2008). Birds 
of a feather: Students’ preferences for lecturers’ personalities as predicted by their own personal¬ 
ity and learning approaches. Personality and Individual Differences, 44, 965-976. 





Regression 





FIGURE 7.1 

Me playing 
with my ding- 
a-ling in the 
Holimarine Talent 
Show. Note the 
groupies queuing 
up at the front 


7.1. What will this chapter tell me? © 


Although none of us can know the future, predicting it is so important that organisms 
are hard-wired to learn about predictable events in their environment. We saw in the 
previous chapter that I received a guitar for Christmas when I was 8. My first foray into 
public performance was a weekly talent show at a holiday camp called ‘Holimarine’ 
in Wales (it doesn’t exist any more because I am old and this was 1981). I sang a 
Chuck Berry song called ‘My Ding-a-ling’ 1 and to my absolute amazement I won the 


1 It appears that even then I had a passion for lowering the tone of things that should be taken seriously. 
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competition. 2 Suddenly other 8-year-olds across the land (well, a ballroom in Wales) wor¬ 
shipped me (I made lots of friends after the competition). I had tasted success, it tasted 
like praline chocolate, and so I wanted to enter the competition in the second week of 
our holiday. To ensure success, I needed to know why I had won in the first week. One 
way to do this would have been to collect data and to use these data to predict people’s 
evaluations of children’s performances in the contest from certain variables: the age of 
the performer, what type of performance they gave (singing, telling a joke, magic tricks), 
and maybe how cute they looked. A regression analysis on these data would enable us 
to predict future evaluations (success in next week’s competition) based on values of 
the predictor variables. If, for example, singing was an important factor in getting a 
good audience evaluation, then I could sing again the following week; however, if jokers 
tended to do better then I could switch to a comedy routine. When I was 8 I wasn’t the 
sad geek that I am today, so I didn’t know about regression analysis (nor did I wish to 
know); however, my dad thought that success was due to the winning combination of a 
cherub-looking 8-year-old singing songs that can be interpreted in a filthy way. He wrote 
me a song to sing in the competition about the keyboard player in the Holimarine Band 
‘messing about with his organ’, and first place was mine again. There’s no accounting 
for taste. 


7.2. An introduction to regression © 


In the previous chapter we looked at how to measure relationships between 
two variables. These correlations can be very useful, but we can take this pro¬ 
cess a step further and predict one variable from another. A simple example 
might be to try to predict levels of stress from the amount of time until you 
have to give a talk. You’d expect this to be a negative relationship (the smaller 
the amount of time until the talk, the larger the anxiety). We could then extend 
this basic relationship to answer a question such as ‘if there’s 10 minutes to go 
until someone has to give a talk, how anxious will they be?’ This is the essence 
of regression analysis: we fit a model to our data and use it to predict values 
of the dependent variable (DV) from one or more independent variables (IVs). 
Regression analysis is a way of predicting an outcome variable from one predictor variable 
(simple regression) or several predictor variables (multiple regression). This tool is incred¬ 
ibly useful because it allows us to go a step beyond the data that we collected. 

In section 2.4.3 I introduced you to the idea that we can predict any data using the fol¬ 
lowing general equation: 

outcome i = (model) + error. (7-1) 

This just means that the outcome we’re trying to predict for a particular person can be 
predicted by whatever model we fit to the data plus some kind of error. In regression, the 
model we fit is linear, which means that we summarize a data set with a straight line (think 
back to Jane Superbrain Box 2.1). As such, the word ‘model’ in the equation above simply 
gets replaced by ‘things’ that define the line that we fit to the data (see the next section). 

With any data set there are several lines that could be used to summarize the general 
trend, and so we need a way to decide which of many possible lines to choose. For the sake 



2 I have a very grainy video of this performance recorded by my dad’s friend on a video camera the size of a 
medium-sized dog that had to be accompanied at all times by a battery pack the size and weight of a tank. Maybe 
I’ll put it up on the companion website ... 






CHAPTER 7 REGRESSION 


247 


of making accurate predictions we want to fit a model that best describes the data. The 
simplest way to do this would be to use your eye to gauge a line that looks as though it 
summarizes the data well. You don’t need to be a genius to realize that the ‘eyeball’ method 
is very subjective and so offers no assurance that the model is the best one that could have 
been chosen. Instead, we use a mathematical technique called the method of least squares 
to establish the line that best describes the data collected. 



Some important information about straight lines © 


I mentioned above that in our general equation the word ‘model’ gets replaced by ‘things 
that define the line that we fit to the data’. In fact, any straight line can be defined by two 
things: (1) the slope (or gradient) of the line (usually denoted by bf); and (2) the point at 
which the line crosses the vertical axis of the graph (known as the intercept of the line, b Q ). 
In fact, our general model becomes equation (7.2) below in which Y is the outcome that 
we want to predict andX is the zth participant’s score on the predictor variable. 3 Here b is 
the gradient of the straight line fitted to the data and b Q is the intercept of that line. These 
parameters b 1 and b Q are known as the regression coefficients and will crop up time and 
time again in this book, where you may see them referred to generally as b (without any 
subscript) or b. (meaning the b associated with variable i). There is a residual term, e, which 
represents the difference between the score predicted by the line for participant i and the 
score that participant i actually obtained. The equation is often conceptualized without this 
residual term (so ignore it if it’s upsetting you); however, it is worth knowing that this term 
represents the fact that our model will not fit the data collected perfectly: 

Y i =(b 0 +b 1 X i )+ e,. (7.2) 

A particular line has a specific intercept and gradient. Figure 7.2 shows a set of lines that 
have the same intercept but different gradients, and a set of lines that have the same gradi¬ 
ent but different intercepts. Figure 7.2 also illustrates another useful point: the gradient of 
the line tells us something about the nature of the relationship being described. In Chapter 
6 we saw how relationships can be either positive or negative (and I don’t mean the dif¬ 
ference between getting on well with your girlfriend and arguing all the time!). A line that 
has a gradient with a positive value describes a positive relationship, whereas a line with a 
negative gradient describes a negative relationship. So, if you look at the graph in Figure 
7.2 in which the gradients differ but the intercepts are the same, then the red line describes 
a positive relationship whereas the green line describes a negative relationship. Basically 
then, the gradient (bf) tells us what the model looks like (its shape) and the intercept ( b Q ) 
tells us where the model is (its location in geometric space). 

If it is possible to describe a line knowing only the gradient and the intercept of that 
line, then we can use these values to describe our model (because in linear regression the 
model we use is a straight line). So, the model that we fit to our data in linear regression 
can be conceptualized as a straight line that can be described mathematically by equation 
(7.2). With regression we strive to find the line that best describes the data collected, then 
estimate the gradient and intercept of that line. Having defined these values, we can insert 


3 You’ll sometimes see this equation written as: 

Y.= (j3„ + ftX)+e, 

The only difference is that this equation has got /3s in it instead of bs and in fact both versions are the same thing, 
they just use different letters to represent the coefficients. 
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FIGURE 7.2 

Lines with the 
same gradients 
but different 
intercepts, 
and lines that 
share the same 
intercept but 
have different 
gradients 



Predictor 

Same intercept, different gradient 



different values of our predictor variable into the model to estimate the value of the out¬ 
come variable. 


1 


. 2 . 2 . 


The method of least squares © 


I have already mentioned that the method of least squares is a way of finding the line that 
best fits the data (i.e., finding a line that goes through, or as close to, as many of the data 
points as possible). This ‘line of best fit’ is found by ascertaining which line, of all of the 
possible lines that could be drawn, results in the least amount of difference between the 
observed data points and the line. Figure 7.3 shows that when any line is fitted to a set of 
data, there will be small differences between the values predicted by the line and the data 
that were actually observed. 

Back in Chapter 2 we saw that we could assess the fit of a model (the example we used 
was the mean) by looking at the deviations between the model and the actual data col¬ 
lected. These deviations were the vertical distances between what the model predicted and 
each data point that was actually observed. We can do exactly the same to assess the fit of 
a regression line (which, like the mean, is a statistical model). So, again we are interested in 
the vertical differences between the line and the actual data because the line is our model: 
we use it to predict values of Y from values of the X variable. In regression these differences 
are usually called residuals rather than deviations, but they are the same thing. As with 
the mean, data points fall both above (the model underestimates their value) and below 
(the model overestimates their value) the line, yielding both positive and negative differ¬ 
ences. In the discussion of variance in section 2.4.2 I explained that if we sum positive and 
negative differences then they tend to cancel each other out and that to circumvent this 
problem we square the differences before adding them up. We do the same thing here. The 
resulting squared differences provide a gauge of how well a particular line fits the data: if 
the squared differences are large, the line is not representative of the data; if the squared 
differences are small, the line is representative. 

You could, if you were particularly bored, calculate the sum of squared differences (or 
SS for short) for every possible line that is fitted to your data and then compare these 
‘goodness-of-fit’ measures. The one with the lowest SS is the line of best fit. Fortunately 
we don’t have to do this because the method of least squares does it for us: it selects the 
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FIGURE 7.3 

This graph 
shows a 
scatterplot of 
some data with a 
line representing 
the general trend. 
The vertical 
lines (dotted) 
represent the 
differences 
(or residuals) 
between the line 
and the actual 
data 


line that has the lowest sum of squared differences (i.e., the line that best represents the 
observed data). How exactly it does this is by using a mathematical technique for finding 
maxima and minima and this technique is used to find the line that minimizes the sum of 
squared differences. I don’t really know much more about it than that to be honest, so I 
tend to think of the process as a little bearded wizard called Nephwick the Line Finder who 
just magically finds lines of best fit. Yes, he lives inside your computer. The end result is 
that Nephwick estimates the value of the slope and intercept of the ‘line of best fit’ for you. 
We tend to call this line of best fit a regression line (or more generally a regression model). 


E 


. 2 . 3 . 


Assessing the goodness of fit: sums of squares, 
R and K 2 © 


Once Nephwick the Line Finder has found the line of best fit it is important that we assess 
how well this line fits the actual data (we assess the goodness of fit of the model). We do this 
because even though this line is the best one available, it can still be a lousy fit to the data. 
In section 2.4.2 we saw that one measure of the adequacy of a model is the sum of squared 
differences (or more generally we assess models using equation (7.3) below). If we want to 
assess the line of best fit, we need to compare it against something, and the thing we choose 
is the most basic model we can find. So we use equation (7.3) to calculate the fit of the most 
basic model, and then the fit of the best model (the line of best fit), and basically if the best 
model is any good then it should fit the data significantly better than our basic model: 

deviation = ^(observed - model) 2 (7.3) 

This is all quite abstract so let’s look at an example. Imagine that I was interested in 
predicting physical and downloaded album sales (Y) from the amount of money spent 
advertising that album (X). One day my boss came in to my office and said ‘Andy, I know 
you wanted to be a rock star and you’ve ended up working as my stats-monkey, but how 
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many albums will we sell if we spend £100,000 on advertising?’ If I didn’t 


have an accurate model of the relationship between album sales and advertis¬ 


ing, what would my best guess be? Well, probably the best answer I could give 
would be the mean number of album sales (say, 200,000) because on average 
that’s how many albums we expect to sell. This response might well satisfy a 
brainless record company executive (who didn’t offer my band a recording 
contract). However, what if he had asked ‘How many albums will we sell if we 
spend £1 on advertising?’ Again, in the absence of any accurate information, 
my best guess would be to give the average number of sales (200,000). There 


is a problem: whatever amount of money is spent on advertising I always predict the same 


levels of sales. As such, the mean is a model of ‘no relationship’ at all between the variables. 
It should be pretty clear then that the mean is fairly useless as a model of a relationship 


between two variables - but it is the simplest model available. 

So, as a basic strategy for predicting the outcome, we might choose to use the mean, because 
on average it will be a fairly good guess of an outcome. Using the mean as a model, we can cal¬ 
culate the difference between the observed values, and the values predicted by the mean (equa¬ 
tion (7.3)). We saw in section 2.4.1 that we square all of these differences to give us the sum 
of squared differences. This sum of squared differences is known as the total sum of squares 
(denoted SS T ) because it is the total amount of differences present when the most basic model 
is applied to the data. This value represents how good the mean is as a model of the observed 
data. Now, if we fit the more sophisticated model to the data, such as a line of best fit, we can 
again work out the differences between this new model and the observed data (again using 
equation (7.3)). In the previous section we saw that the method of least squares finds the best 
possible line to describe a set of data by minimizing the difference between the model fitted 
to the data and the data themselves. However, even with this optimal model there is still some 
inaccuracy, which is represented by the differences between each observed data point and the 
value predicted by the regression line. As before, these differences are squared before they are 
added up so that the directions of the differences do not cancel out. The result is known as the 
sum of squared residuals or residual sum of squares (SS R ). This value represents the degree of 
inaccuracy when the best model is fitted to the data. We can use these two values to calculate 
how much better the regression line (the line of best fit) is than just using the mean as a model 
(i.e., how much better is the best possible model than the worst model?). The improvement 
in prediction resulting from using the regression model rather than the mean is obtained by 
calculating the difference between SS T and SS R . This difference shows us the reduction in the 
inaccuracy of the model resulting from fitting the regression model to the data. This improve¬ 
ment is the model sum of squares (SS M ). Figure 7.4 shows each sum of squares graphically. 

If the value of SS M is large then the regression model is very different from using the 
mean to predict the outcome variable. This implies that the regression model has made a big 
improvement to how well the outcome variable can be predicted. However, if SS M is small 
then using the regression model is little better than using the mean (i.e., the regression model 
is no better than taking our ‘best guess’). A useful measure arising from these sums of squares 
is the proportion of improvement due to the model. This is easily calculated by dividing the 
sum of squares for the model by the total sum of squares. The resulting value is called R 2 and 
to express this value as a percentage you should multiply it by 100. R 2 represents the amount 
of variance in the outcome explained by the model (SS M ) relative to how much variation 
there was to explain in the first place (SS T ). Therefore, as a percentage, it represents the per¬ 
centage of the variation in the outcome that can be explained by the model: 



(7.4) 


This R 2 is the same as the one we met in Chapter 6 (section 6.5.4.3) and you might have 
noticed that it is interpreted in the same way. Therefore, in simple regression we can take 
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x 


x 


SS T uses the differences 
between the observed data 
and the mean value of Y 



SS R uses the differences 
between the observed data 
and the regression line 


FIGURE 7.4 

Diagram showing 
from where the 
regression sums 
of squares derive 



X 


SS M uses the differences 
between the mean value of Y 
and the regression line 


the square root of this value to obtain Pearson’s correlation coefficient. As such, the cor¬ 
relation coefficient provides us with a good estimate of the overall fit of the regression 
model, and R 2 provides us with a good gauge of the substantive size of the relationship. 

A second use of the sums of squares in assessing the model is through the F-test. I 
mentioned way back in Chapter 2 that test statistics (like F) are usually the amount of 
systematic variance divided by the amount of unsystematic variance, or, put another way, 
the model compared against the error in the model. This is true here: F is based upon the 
ratio of the improvement due to the model (SS M ) and the difference between the model 
and the observed data (SS R ). Actually, because the sums of squares depend on the number 
of differences that we have added up, we use the average sums of squares (referred to as 
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the mean squares or MS). To work out the mean sums of squares we divide by the degrees 
of freedom (this is comparable to calculating the variance from the sums of squares - see 
section 2.4.2). For SS M the degrees of freedom are simply the number of variables in the 
model, and for SS R they are the number of observations minus the number of parameters 
being estimated (i.e., the number of beta coefficients including the constant). The result is 
the mean squares for the model (MS m ) and the residual mean squares (MS R ). At this stage 
it isn’t essential that you understand how the mean squares are derived (it is explained in 
Chapter 10). However, it is important that you understand that the F-ratio (equation (7.5)) 
is a measure of how much the model has improved the prediction of the outcome com¬ 
pared to the level of inaccuracy of the model. 


F= 


MS r 


(7.5) 


If a model is good, then we expect the improvement in prediction due to the model to be 
large (so MS m will be large) and the difference between the model and the observed data to 
be small (so MS R will be small). In short, a good model should have a large T-ratio (greater 
than 1 at least) because the top of equation (7.5) will be bigger than the bottom. The exact 
magnitude of this T-ratio can be assessed using critical values for the corresponding degrees 
of freedom (as in the Appendix). 


Assessing individual predictors © 


We’ve seen that the predictor in a regression model has a coefficient (bj, which in simple 
regression represents the gradient of the regression line. The value of b represents the 
change in the outcome resulting from a unit change in the predictor. If the model was 
useless at predicting the outcome, then if the value of the predictor changes, what might 
we expect the change in the outcome to be? Well, if the model is very bad then we would 
expect the change in the outcome to be zero. Think back to Figure 7.4 (see the panel 
representing SS T ) in which we saw that using the mean was a very bad way of predict¬ 
ing the outcome. In fact, the line representing the mean is flat, which means that as the 
predictor variable changes, the value of the outcome does not change (because for each 
level of the predictor variable, we predict that the outcome will equal the mean value). 
The important point here is that a bad model (such as the mean) will have regression 
coefficients of 0 for the predictors. A regression coefficient of 0 means: (1) a unit change 
in the predictor variable results in no change in the predicted value of the outcome (the 
predicted value of the outcome does not change at all); and (2) the gradient of the regres¬ 
sion line is 0, meaning that the regression line is flat. Hopefully, you’ll see that it logically 
follows that if a variable significantly predicts an outcome, then it should have a 6-value 
significantly different from zero. This hypothesis is tested using a (-test (see Chapter 9). 
The (-statistic tests the null hypothesis that the value of b is 0: therefore, if it is significant 
we gain confidence in the hypothesis that the 6-value is significantly different from 0 and 
that the predictor variable contributes significantly to our ability to estimate values of 
the outcome. 

Like F, the (-statistic is also based on the ratio of explained variance against unex¬ 
plained variance or error. Well, actually, what we’re interested in here is not so much 
variance but whether the 6 we have is big compared to the amount of error in that 
estimate. To estimate how much error we could expect to find in 6 we use the standard 
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error. The standard error tells us something about how different 6-values would be 
across different samples. We could take lots and lots of samples of data regarding 
album sales and advertising budgets and calculate the 6-values for each sample. We 
could plot a frequency distribution of these samples to discover whether the 6-values 
from all samples would be relatively similar, or whether they would be very different 
(think back to section 2.5.1). We can use the standard deviation of this distribution 
(known as the standard error) as a measure of the similarity of 6-values across samples. 
If the standard error is very small, then it means that most samples are likely to have 
a 6-value similar to the one in our sample (because there is little variation across sam¬ 
ples). The 7-test tells us whether the 6-value is different from 0 relative to the variation 
in 6-values across samples. When the standard error is small even a small deviation 
from zero can reflect a meaningful difference because 6 is representative of the major¬ 
ity of possible samples. 

Equation (7.6) shows how the 7-test is calculated and you’ll find a general version of 
this equation in Chapter 9 (equation (9.1)). The 6 expccted term is simply the value of 6 that 
we would expect to obtain if the null hypothesis were true. I mentioned earlier that the 
null hypothesis is that 6 is 0 and so this value can be replaced by 0. The equation sim¬ 
plifies to become the observed value of 6 divided by the standard error with which it is 
associated: 


7 = 


^observed ^expected 

SR 


b 


observed 


(7.6) 


SE b 


The values of 7 have a special distribution that differs according to the degrees of freedom 
for the test. In regression, the degrees of freedom are N — p — 1, where N is the total 
sample size and p is the number of predictors. In simple regression when we have only 
one predictor, so this gives N — 2. Having established which 7-distribution needs to be 
used, the observed value of 7 can then be compared to the values that we would expect to 
find if there was no effect (i.e., 6 = 0): if 7 is very large then it is unlikely to have occurred 
when there is no effect (these values can be found in the Appendix). R provides the exact 
probability that the observed value (or a larger one) of 7 would occur if the value of 6 was, 
in fact, 0. As a general rule, if this observed significance is less than .05, then scientists 
assume that 6 is significantly different from 0; put another way, the predictor makes a 
significant contribution to predicting the outcome. 


7.3. Packages used in this chapter © 


There are several packages we will use in this chapter. Some, but not all, can be accessed 
through R Commander. You will need the packages boot (for bootstrapping), car (for 
regression diagnostics) and QuantPsyc (to get standardized regression coefficients). If you 
don’t have these packages installed you’ll need to install them (boot comes pre-installed) 
by executing: 

install.packagesC'car"); install .packages("QuantPsyc") 
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Then you need to load the packages by executing these commands (although boot is 
installed with the base stats package, you still need to load it): 

library(boot); library(car); library(QuantPsyc) 


7.4. General procedure for regression in R © 

Doing simple regression using R Commander © 


So far, we have seen a little of the theory behind regression, albeit restricted to the situation 
in which there is only one predictor. To help clarify what we have learnt so far, we will go 
through an example of a simple regression using R. Earlier on I asked you to imagine that 
I worked for a record company and that my boss was interested in predicting album sales 
from advertising. There are some data for this example in the file Album Sales l.dat. 

To conduct a regression analysis using R Commander, first initiate the package by exe¬ 
cuting the command: 

library(Rcmdr) 

Once you have initiated the package, you need to load the data file into R. You can read 
Album Sales l.dat into R Commander by using Data => Import data => from text file, 
clipboard, or URL... (see section 3.7.3). We can click on view dan s«| to look at the data and 
check they were read into R properly. Figure 7.5 shows the data: there are 200 rows, 


FIGURE 7.5 
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9 
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10 

174.093 

300 


11 

1720.806 

290 


12 

611.479 

70 


13 

251.192 
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14 

97.972 

190 


15 

406.814 

240 


16 

265.398 
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17 

1323.287 
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18 
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19 

1326.598 
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20 

1380.689 
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21 

792.345 
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22 

957.167 
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23 
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24 

656.137 

210 


25 
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26 

313.362 
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21 

336.510 

60 


28 

1544.899 

330 


29 

68.954 

150 


30 

785.692 

150 
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each one representing a different album. There are also two columns, one representing 
the sales of each album (in thousands) in the week after release and the other represent¬ 
ing the amount (in thousands of pounds) spent promoting the album before release. 
This is the format for entering regression data: the outcome variable and any predictors 
should be entered in different columns, and each row should represent independent val¬ 
ues of those variables. 

The pattern of the data is shown in Figure 7.6, and it should be clear that a positive 
relationship exists: so, the more money spent advertising the album, the more it is likely 
to sell. Of course there are some albums that sell well regardless of advertising (top left 
of scatterplot), but there are none that sell badly when advertising levels are high (bottom 
right of scatterplot). The scatterplot also shows the line of best fit for these data: bearing in 
mind that the mean would be represented by a flat line at around the 200,000 sales mark, 
the regression line is noticeably different. 



• • • • 

50- 


500 1000 1500 2000 

Amount Spent on Adverts (thousands of pounds) 


FIGURE 7.6 

Scatterplot 
showing the 
relationship 
between album 
sales and the 
amount spent 
promoting the 
album 


To find out the parameters that describe the regression line, and to see whether this 
line is a useful model, we need to run a regression analysis. In R Commander, choose 
Statistics=>Fit models=>Linear regression to activate the linear regression dialog box 
(Figure 7.7). On the left we choose a response variable - this is the outcome, or depend¬ 
ent variable. On the right we choose an explanatory (predictor, or independent) variable. 
In this case our outcome is sales so we have highlighted this variable in the list labelled 
Response variable (pick one), and the predictor variable is adverts, so we have selected 
this variable in the list labelled Explanatory variables (pick one or more). At the top of the 
dialog box, there is a box labelled Enter name for model: by default R Commander has 
named the model albumSales.l. By replacing the text in this box we can change that name 
of the model, for example, to albumSalesModel or whatever makes sense to you. When you 
have selected your variables and named the model, click on 0K : The resulting output is 
described in section 7.5. 
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FIGURE 7.7 

Linear regression 
in R Commander 
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Regression in R ® 


First load the data file by setting your working directory to the location of the file (see sec¬ 
tion 3.4.4) and executing: 

albuml<-read.delim(“Album Sales l.dat”, header = TRUE) 

We run a regression analysis using the lm() function - lm stands for ‘linear model’. This 
function takes the general form: 

newModelc-lmfoutcome ~ predictor(s), data = dataFrame, na.action = an action)) 
in which: 

• newModel is an object created that contains information about the model. We can get 
summary statistics for this model by executing summary (newModel) and summary. 
Im(newModel) for specific parameters of the model. 

• outcome is the variable that you’re trying to predict, also known as the dependent 
variable. In this example it will be the variable sales. 

• predictor(s) lists the variable or variables from which you’re trying to predict the 
outcome variable. In this example it will be the variable adverts. In more complex 
designs we can specify several predictors but we’ll come to that in due course. 

• dataFrame is the name of the dataframe from which your outcome and predictor 
variables come. 

• na.action is an optional command. If you have complete data (as we have here) you 
can ignore it, but if you have missing values (i.e., NAs in the dataframe) then it can 
be useful - see R’s Souls’ Tip 19.2). 

The important part to note (especially important because many analyses in the rest of the 
book uses some variant of lm()) is that within the function we write a formula that specifies 
the model that we want to estimate. This model takes the form: 

outcome variable ~ predictor variable 
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in which ~ (tilde) means ‘predicted from’. (We can’t write *=’ because that would confuse 
R, plus we’re not saying the outcome is equal to the predictor, just that the outcome has 
something to do with the predictor.) 

As with functions we have come across before, we can reference variables in two ways: 
we can either put the whole variable name, including the dataframe: 

albumSales.l <- lm(albuml$sales ~ albuml$adverts) 

or we can tell R what dataframe to use (using data = nameOfDataFrame), and then specify 
the variables without the dataFrameName$ before them: 

albumSales.l <- Imfsales ~ adverts, data = albuml) 

I prefer this second method, but both of these commands create an object called album- 
Salesl that contains information about the model. (Note that the command we have just 
written is the same as the command that R Commander generates for us using the menus.) 



Missing dataO 


Often data sets have missing data, which might be denoted with a placeholder such as ‘NA’, ‘Missing’, or a 
number that denotes missing such as 9999. As we have seen before, when missing data are imported into R you 
typically get an NA in your dataframe to denote the missing value. 

If you try to estimate a model with dataframes that have missing values you will get an error because lm() does 
not know what to do with the NAs that it finds in the data. Therefore, you can add na.action = action to the function 
to let it know what to do. There are two main options: 


1. na.action = na.fail: This is the default and it simply means that if there are any missing values the model will 
fail to compute. 

2. na.action = na.omit or na.exclude'. This estimates the model but excludes any case that has any missing 
data on any variable in the model (this is sometimes known as casewise deletion). There are subtle differ¬ 
ences between the two but they are so subtle I haven’t worked out what they are. 


Therefore, if we had missing values in the data we should specify our album sales model as: 
albumSales.l <- Imfsales ~ adverts, data = albuml, na.action = na.exclude) 


7.5. Interpreting a simple regression © 


We have created an object called albumSales. 1 that contains the results of our analysis. We can 
show the object by executing: 

summaryCalbumSales.1) 

which displays the information in Output 7.1. 


Call: 


lm(formula 

= sales 

~ adverts, data = 

albuml) 

Residuals: 

Min 

IQ 

Median 3Q 

Max 

-152.949 ■ 

-43.796 

-0.393 37.040 

211.866 
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Coefficients: 

Estimate Std. Error t value Pr(>|t|) 

(Intercept) 1.341e+02 7.537e+00 17.799 <2e-16 *** 
adverts 9.612e-02 9.632e-03 9.979 <2e-16 *** 

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 0.1 ' ' 1 

Residual standard error: 65.99 on 198 degrees of freedom 
Multiple R-squared: 0.3346, Adjusted R-squared: 0.3313 

F-statistic: 99.59 on 1 and 198 DF, p-value: < 2.2e-16 

Output 7.1 


Overall fit of the object model © 


Let’s start at the bottom of Output 7.1: 

Multiple R-squared: 0.3346, Adjusted R-squared: 0.3313 

This part of the output provides the value of R 2 and adjusted R 2 for the model that has 
been derived. For these data, R 2 has a value of .335. Because there is only one predictor, 
this value represents the square of the simple correlation between advertising and album 
sales - we can find the square root of R 2 by running: 

sqrt(0.3346) 

Which R tells us is: 

[1] 0.5784462 

The Pearson correlation coefficient is, therefore, 0.58. (You can confirm this by running 
a correlation using what you were taught in Chapter 6.) The value of R 2 of .335 also tells 
us that advertising expenditure can account for 33.5% of the variation in album sales. 
In other words, if we are trying to explain why some albums sell more than others, we 
can look at the variation in sales of different albums. There might be many factors that 
can explain this variation, but our model, which includes only advertising expenditure, can 
explain approximately 33% of it. This means that 67% of the variation in album sales can¬ 
not be explained by advertising alone. Therefore, there must be other variables that have 
an influence also. 

The next part of Output 7.1 reports the results of an analysis of variance (ANOVA - see 
Chapter 10): 

F-statistic: 99.59 on 1 and 198 DF, p-value: < 2.2e-16 

It doesn’t give us all of the sums of squares, it just gives the important part: the T-ratio, 
which is calculated using equation (7.5), and the associated significance value of that 
T-ratio. For these data, F is 99.59, which is significant at p < .001 4 (because the value 


4 Remember that when R wants to show small or large numbers it uses exponential notation. So 2.2e-l6 
means “2.2 with the decimal place moved 16 places to the left, and add zeros as necessary”, which means: 
0.00000000000000022. That’s a very small number indeed. 
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labelled p-value is less than .001). This result tells us that there is less than a 0.1% chance 
that an F-ratio this large would happen if the null hypothesis were true. Therefore, we 
can conclude that our regression model results in significantly better prediction of album 
sales than if we used the mean value of album sales. In short, the regression model overall 
predicts album sales significantly well. 


7 . 5 . 2 . 


Model parameters © 


The ANOVA tells us whether the model, overall, results in a significantly good degree of 
prediction of the outcome variable. However, the ANOVA doesn’t tell us about the indi¬ 
vidual contribution of variables in the model (although in this simple case there is only one 
variable in the model and so we can infer that this variable is a good predictor). 

The final part of Output 7.1 that we will look at (for now) is the part 
labelled Coefficients. This part contains the model parameters (the beta val¬ 
ues) and the significance of these values. We saw in equation (7.2) that b Q 
was the Y intercept and this value is the value in the Estimate column for the 
(intercept). (Notice that R puts intercept in brackets, because it’s in a list of 
variables, but it’s not a real variable). So, from the table shown in Output 
7.1, we can say that b Q is 134.1, and this can be interpreted as meaning that 
when no money is spent on advertising (when X = 0), the model predicts 
that 134,100 albums will be sold (remember that our unit of measurement 
was thousands of albums). We can also read off the value of b 1 from the 
row labelled adverts and this value represents the gradient of the regression line. It is 
0.096. Although this value is the slope of the regression line, it is more useful to think of 
this value as representing the change in the outcome associated with a unit change in the 
predictor. Therefore, if our predictor variable is increased by one unit (if the advertising 
budget is increased by 1), then our model predicts that 0.096 units of extra albums will be 
sold. Our units of measurement were thousands of pounds and thousands of albums sold, 
so we can say that for an increase in advertising of £1000 the model predicts 96 (0.096 x 
1000 = 96) extra album sales. As you might imagine, this investment is pretty bad for the 
record company: it invests £1000 and gets only 96 extra sales. 

We saw earlier that, in general, values of the regression coefficient b represent the change 
in the outcome resulting from a unit change in the predictor and that if a predictor is having a 
significant impact on our ability to predict the outcome then this b should be different from 0 
(and big relative to its standard error). We also saw that the 7-test tells us whether the b-v alue 
is different from 0. R provides the exact probability that the observed value of t would occur 
if the value of b in the population were 0. If this observed significance is less than .05, then 
scientists agree that the result reflects a genuine effect (see Chapter 2). For these two values, 
the probabilities are <2e-16 (which means 15 zeros, followed by a 2) and so we can say 
that the probability of these 7-values (or larger) occurring if the values of b in the population 
were 0 is less than .001. Therefore, the bs are different from 0 and we can conclude that the 
advertising budget makes a significant contribution (p < .001) to predicting album sales. 




SELF-TEST 

s How is the 7 in Output 7.1 calculated? Use the 
values in the output to see if you can get the same 
value as R. 
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Using the model © 


So far, we have discovered that we have a useful model, one that significantly improves our 
ability to predict album sales. However, the next stage is often to use that model to make 
some predictions. The first stage is to define the model by replacing the b-values in equa¬ 
tion (7.2) with the values from the output. In addition, we can replace the X and Y with the 
variable names so that the model becomes: 

album sales ; = £> 0 + b 1 advertising budget. 

=134.14 + (0.096 x advertising budget.) (7.7) 

It is now possible to make a prediction about album sales, by replacing the advertising bud¬ 
get with a value of interest. For example, imagine a recording company executive wanted 
to spend £100,000 on advertising a new album. Remembering that our units are already 
in thousands of pounds, we can simply replace the advertising budget with 100. He would 
discover that album sales should be around 144,000 for the first week of sales: 

album sales = 134.14 + (0.096 x advertising budget ) 

‘=134.14 +(0.096 x 100) 

= 143.74 (7.8) 



SELF-TEST 

s How many units would be sold if we spent £666,000 
on advertising the latest album by black metal band 
Abgott? 



CRAMMING SAM’S TIPS 


imple regression 


• Simple regression is a way of predicting values of one variable from another. 

• We do this by fitting a statistical model to the data in the form of a straight line. 

• This line is the line that best summarizes the pattern of the data. 

• We have to assess how well the line fits the data using: 

o R 2 , which tells us how much variance is explained by the model compared to how much variance there is to explain in the 
first place. It is the proportion of variance in the outcome variable that is shared by the predictor variable, 
o F, which tells us how much variability the model can explain relative to how much it can’t explain (i.e., it’s the ratio of how 
good the model is compared to how bad it is). 

• The b-value tells us the gradient of the regression line and the strength of the relationship between a predictor and the out¬ 
come variable. If it is significant ( Pr(> \ 1 1) < .05 in the R output) then the predictor variable significantly predicts the outcome 
variable. 
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7.6. Multiple regression: the basics © 


To summarize what we have learnt so far, in simple linear regression the out¬ 
come variable Yis predicted using the equation of a straight line (equation (7.2)). 
Given that we have collected several values of Y andX, the unknown parameters 
in the equation can be calculated. They are calculated by fitting a model to the 
data (in this case a straight line) for which the sum of the squared differences 
between the line and the actual data points is minimized. This method is called 
the method of least squares. Multiple regression is a logical extension of these 
principles to situations in which there are several predictors. Again, we still use 
our basic equation: 



outcome. = (model) + error. 


but this time the model is slightly more complex. It is basically the same as for simple 
regression except that for every extra predictor you include, you have to add a coefficient; 
so, each predictor variable has its own coefficient, and the outcome variable is predicted 
from a combination of all the variables multiplied by their respective coefficients plus a 
residual term (see equation (7.9) - the brackets aren’t necessary, they’re just to make the 
connection to the general equation above): 


Y \ - (b 0 +b 1 X li +b 1 X 2i + ... + b n X ni ) + £ i ^ ^ 

Y is the outcome variable, b is the coefficient of the first predictor (Xj), b 2 is the coefficient 
of the second predictor (X 2 ), b n is the coefficient of the «th predictor (X nj ), and £. is the dif¬ 
ference between the predicted and the observed value of Y for the z’th participant. In this 
case, the model fitted is more complicated, but the basic principle is the same as simple 
regression. That is, we seek to find the linear combination of predictors that correlate 
maximally with the outcome variable. Therefore, when we refer to the regression model in 
multiple regression, we are talking about a model in the form of equation (7.9). 



An example of a multiple regression model © 


Imagine that our recording company executive was interested in extending his model of 
album sales to incorporate another variable. We know already that advertising accounts for 
33% of variation in album sales, but a much larger 67% remains unexplained. The record 
executive could measure a new predictor in an attempt to explain some of the unexplained 
variation in album sales. He decides to measure the number of times the album is played 
on Radio 1 (the UK’s biggest national radio station) during the week prior to release. The 
existing model that we derived using R (see equation (7.7)) can now be extended to include 
this new variable (airplay): 

Album Sales ; = ( b Q + b 1 advertising budget. + b 2 airplay.) + £ (7.10) 

The new model is based on equation (7.9) and includes a fe-value for both predictors (and, 
of course, the constant). If we calculate the ^-values, we could make predictions about 
album sales based not only on the amount spent on advertising but also in terms of radio 
play. There are only two predictors in this model and so we could display this model 
graphically in three dimensions (Figure 7.8). 
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FIGURE 7.8 

Scatterplot of 
the relationship 
between album 
sales, advertising 
budget and radio 
play 



Equation (7.9) describes the tinted trapezium in the diagram (this is known as the regres¬ 
sion plane) and the dots represent the observed data points. Like simple regression, the 
plane fitted to the data aims to best predict the observed data. However, there are invari¬ 
ably some differences between the model and the real-life data (this fact is evident because 
some of the dots do not lie exactly on the tinted area of the graph). The h-value for adver¬ 
tising describes the slope of the left and right sides of the regression plane, whereas the 
h-value for airplay describes the slope of the top and bottom of the regression plane. Just 
like simple regression, knowledge of these two slopes tells us about the shape of the model 
(what it looks like) and the intercept locates the regression plane in space. 

It is fairly easy to visualize a regression model with two predictors, because it is possible 
to plot the regression plane using a 3-D scatterplot. However, multiple regression can be 
used with three, four or even ten or more predictors. Although you can’t immediately 
visualize what such complex models look like, or visualize what the h-values represent, you 
should be able to apply the principles of these basic models to more complex scenarios. 


7 . 6 . 2 . 


Sums of squares, R and R 2 © 


When we have several predictors, the partitioning of sums of squares is the same as in the 
single variable case except that the model we refer to takes the form of equation (7.9) rather 
than simply being a 2-D straight line. Therefore, SS T can be calculated that represents the 
difference between the observed values and the mean value of the outcome variable. SS R still 
represents the difference between the values of Y predicted by the model and the observed 
values. Finally, SS M can still be calculated and represents the difference between the values 
of Y predicted by the model and the mean value. Although the computation of these values 
is much more complex than in simple regression, conceptually these values are the same. 

When there are several predictors we can’t look at the simple R 2 , and instead R produces 
a multiple R 2 . Multiple R 1 is the square of the correlation between the observed values of 
Y and the values of Y predicted by the multiple regression model. Therefore, large values 
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of multiple R 2 represent a large correlation between the predicted and observed values 
of the outcome. A multiple R 2 of 1 represents a situation in which the model perfectly 
predicts the observed data. As such, multiple R 2 is a gauge of how well the model predicts 
the observed data. It follows that the resulting R 2 can be interpreted in the same way as in 
simple regression: it is the amount of variation in the outcome variable that is accounted 
for by the model. 


7 . 6 . 3 . 


Parsimony-adjusted measures of fit (D 


The big problem with R 2 is that when you add more variables to the model, it will always 
go up. If you are deciding which of two models fits the data better, the model with more 
predictor variables in will always fit better. The Akaike information criterion (AIC) 5 is a mea¬ 
sure of fit which penalizes the model for having more variables - a little like adjusted R 2 . 
The AIC is defined as: 


AIC = n in 



+ 2k 


in which n is the number of cases in the model, in is the natural log, SSE is the sum of 
square errors for the model, and k is the number of predictor variables. We are not going 
to worry too much about this equation, other than to notice that the final part - the 2k - is 
the part that does all the work. 

Imagine we add a variable to the model; usually this would increase R 2 , and hence SSE 
would be reduced. But imagine that this variable does not change the fit of the model 
at all. What will happen to the AIC? Well, the first part will be the same: n and SSE are 
unchanged. What will change is k: it will be higher, by one (because we have added a vari¬ 
able). Hence, when we add this variable to the model, the AIC will be higher by 2. A larger 
value of the AIC indicates worse fit, corrected for the number of variables. 

There are a couple of strange things about the AIC. One of them is there are no guide¬ 
lines for how much larger is ‘a lot’ and how much larger is ‘not very much’: If the AIC is 
bigger, the fit is worse; if the AIC is smaller, fit is better. 

The second thing about the AIC is that it makes sense to compare the AIC only between 
models of the same data. The AIC doesn’t mean anything on its own: you cannot say that 
a value of the AIC of 10 is small, or that a value for the AIC of 1000 is large. The only 
thing you do with the AIC is compare it to other models with the same outcome variable. 

R also provides the option of a second measure of parsimony adjusted model fit, called 
the Bayesian information criterion (BIC), but that is rather beyond the level of this book. 


Methods of regression (D 


If we are interested in constructing a complex model with several predictors, how do we 
decide which predictors to use? A great deal of care should be taken in selecting predictors 
for a model because the values of the regression coefficients depend upon the variables in 


5 Hirotsugu Akaike (pronounced A-Ka-Ee-Kay) was a Japanese statistician who gave his name to the AIC, which is 
used in a huge range of different places. You get some idea of this range when you find out that the paper in which 
the AIC was proposed was published in a journal called IEEE Transactions on Automatic Control. 
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the model. Therefore, the predictors included and the way in which they are entered into 
the model can have a great impact. In an ideal world, predictors should be selected based 
on past research. 6 If new predictors are being added to existing models then select these 
new variables based on the substantive theoretical importance of these variables. One thing 
not to do is select hundreds of random predictors, bung them all into a regression analysis 
and hope for the best. In addition to the problem of selecting predictors, there are several 
ways in which variables can be entered into a model. When predictors are all completely 
uncorrelated, the order of variable entry has very little effect on the parameters calculated; 
however, we rarely have uncorrelated predictors and so the method of predictor selection 
is crucial. 

7.6.4.I. Hierarchical © 


In hierarchical regression predictors are selected based on past work and the experimenter 
decides in which order to enter the predictors into the model. As a general rule, known 
predictors (from other research) should be entered into the model first in order of their 
importance in predicting the outcome. After known predictors have been entered, the 
experimenter can add any new predictors into the model. New predictors can be entered 
either all in one go, in a stepwise manner, or hierarchically (such that the new predictor 
suspected to be the most important is entered first). 

7.6.4.2. Forced entry © 

Forced entry is a method in which all predictors are forced into the model simultane¬ 
ously. Like hierarchical, this method relies on good theoretical reasons for including the 
chosen predictors, but unlike hierarchical the experimenter makes no decision about the 
order in which variables are entered. Some researchers believe that this method is the only 
appropriate method for theory testing (Studenmund & Cassidy, 1987) because stepwise 
techniques are influenced by random variation in the data and so seldom give replicable 
results if the model is retested. 

7.6.4.3. Stepwise methods © 

Stepwise regressions are generally frowned upon by statisticians, and R is not as good at 
running automated stepwise regressions as some other statistics programs we could men¬ 
tion. However, I’m still going to tell you how to do them, but be aware that if you can’t do 
a stepwise regression in the same way in R that you can in another program, that’s because 
the other program was written 40 years ago when people didn’t know better. In stepwise 
regression decisions about the order in which predictors are entered into the model are 
based on a purely mathematical criterion. 

When you carry out a stepwise regression in R, you need to specify a direction. In 
the forward direction, an initial model is defined that contains only the constant ( b Q ). The 
computer then searches for the predictor (out of the ones available) that best predicts the 
outcome variable - it does this by selecting the predictor that has the highest simple cor¬ 
relation with the outcome. If this predictor improves the ability of the model to predict 
the outcome, then this predictor is retained in the model and the computer searches for 


6 1 might cynically qualify this suggestion by proposing that predictors be chosen based on past research that has 
utilized good methodology. If basing such decisions on regression analyses, select predictors based only on past 
research that has used regression appropriately and yielded reliable, generalizable models. 
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a second predictor. The criterion used for selecting this second predictor is that it is the 
variable that has the largest semi-partial correlation with the outcome. Let me explain this 
in plain English. Imagine that the first predictor can explain 40% of the variation in the 
outcome variable; then there is still 60% left unexplained. The computer searches for the 
predictor that can explain the biggest part of the remaining 60% (so it is not interested in 
the 40% that is already explained). As such, this semi-partial correlation gives a measure 
of how much ‘new variance’ in the outcome can be explained by each remaining predictor 
(see section 6.6). The predictor that accounts for the most new variance is added to the 
model and, if it makes a contribution to the predictive power of the model, it is retained 
and another predictor is considered. 

R has to decide when to stop adding predictors to the model, and it does this based on 
the Akaike information criterion which was described above: a lower AIC indicates a better 
model. A variable is kept in the model only if it improves (i.e., lowers) the AIC, and if no 
variable can lower the AIC further, the model is stopped. 

The backward method is the opposite of the forward method in that the computer begins 
by placing all predictors in the model and then by looking to see if the AIC goes down 
when each variable is removed. If a variable is removed, the contribution of the remaining 
predictors is then reassessed and the process continues until removing any variable causes 
AIC to increase. 

The final direction is called ‘both’ by R (and stepwise by some other programs). This 
method, as the name implies, goes in both directions. It starts the in same way as the for¬ 
ward method, except that each time a predictor is added to the equation, a removal test 
is made of the least useful predictor. As such the regression equation is constantly being 
reassessed to see whether any redundant predictors can be removed. 

If you do decide to use a stepwise method then the backward direction is preferable to 
the forward method. This is because of suppressor effects, which occur when a predictor 
has an effect but only when another variable is held constant. Forward selection is more 
likely than backward elimination to exclude predictors involved in suppressor effects. As 
such, the forward method runs a higher risk of making a Type II error (i.e., missing a pre¬ 
dictor that does in fact predict the outcome). 


7.6.4.4. All-subsets methods © 


The problem with stepwise methods is that they assess the fit of a variable based on the 
other variables that were in the model. Some people use the analogy of getting dressed to 
describe this problem. If a stepwise regression method was selecting your clothes, it would 
decide what clothes you should wear, based on the clothes it has already selected. If, for 
example, it is a cold day, a stepwise selection method might choose a pair of trousers to put 
on first. But if you are wearing trousers already, it is difficult to get your underwear on: 
stepwise methods will decide that underwear does not fit, and you will therefore go with¬ 
out. A better method is all-subsets regression. As the name implies, all-subsets regression 
tries every combination of variables, to see which one gives the best fit (fit is determined by 
a statistic called Mallows’ C , which we are not going to worry about). The problem with 
all-subsets regression is that as the number of predictor variables increases, the number of 
possible subsets increases exponentially. If you have two predictor variables, A and B, then 
you have 4 possible subsets: none of them, A alone, B alone, or A and B. If you have three 
variables (A, B, C), the possible subsets are none, A, B, C, AB, AC, BC, ABC, making 8 sub¬ 
sets. If you have 10 variables, there are 1024 possible subsets. In the days when computers 
were slower and running a regression analysis might take a couple of minutes, running 
1024 regressions might take a day or so. Thankfully, computers aren’t slow any more, and 
so this method is feasible - it’s just that other programs have not yet caught up with R, so 
you tend to come across this method less. 
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7.6.4.5. Choosing a method © 

R allows you to opt for any one of these methods and it is important to select 
an appropriate one. The three directions of stepwise selection (forward, back¬ 
ward and both) and all-subsets regression all come under the general heading of 
stepwise methods because they all rely on the computer selecting variables based 
upon mathematical criteria. Many writers argue that this takes many important 
methodological decisions out of the hands of the researcher. What’s more, the 
models derived by computer often take advantage of random sampling varia¬ 
tion and so decisions about which variables should be included will be based 
upon slight differences in their semi-partial correlation. However, these slight 
statistical differences may contrast dramatically with the theoretical importance 
of a predictor to the model. There is also the danger of over-fitting (having too many variables 
in the model that essentially make little contribution to predicting the outcome) and under¬ 
fitting (leaving out important predictors) the model. For this reason stepwise methods are best 
avoided except for exploratory model building. If you must do a stepwise regression then it is 
advisable to cross-validate your model by splitting the data (see section 7.7.2.2). 

When there is a sound theoretical literature available, then base your model upon what 
past research tells you. Include any meaningful variables in the model in their order of 
importance. After this initial analysis, repeat the regression but exclude any variables that 
were statistically redundant the first time around. There are important considerations in 
deciding which predictors should be included. First, it is important not to include too many 
predictors. As a general rule, the fewer predictors the better, and certainly include only 
predictors for which you have a good theoretical grounding (it is meaningless to measure 
hundreds of variables and then put them all into a regression model). So, be selective and 
remember you should have a decent sample size - see section 7.7.2.3. 



7.7. How accurate is my regression model? © 



When we have produced a model based on a sample of data there are two 
important questions to ask. First, does the model fit the observed data well, 
or is it influenced by a small number of cases? Second, can my model gen¬ 
eralize to other samples? These questions are vital to ask because they affect 
how we use the model that has been constructed. These questions are also, in 
some sense, hierarchical because we wouldn’t want to generalize a bad model. 
However, it is a mistake to think that because a model fits the observed data 
well we can draw conclusions beyond our sample. Generalization is a critical 
additional step, and if we find that our model is not generalizable, then we 
must restrict any conclusions based on the model to the sample used. First, 
we will look at how we establish whether a model is an accurate representa¬ 
tion of the actual data, and in section 7.7.2 we move on to look at how we assess whether 
a model can be used to make inferences beyond the sample of data that has been collected. 


| Assessing the regression model I: diagnostics © 


To answer the question of whether the model fits the observed data well, or if it is influ¬ 
enced by a small number of cases, we can look for outliers and influential cases (the differ¬ 
ence is explained in Jane Superbrain Box 7.1). We will look at these in turn. 
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JANE SUPERBRAIN 7.1 

The difference between residuals and 
influence statistics © 

In this section I describe two ways to look for cases that 
might bias the model: residual and influence statistics. 
To illustrate how these measures differ, imagine that 
the Mayor of London at the turn of the last century was 
interested in how drinking affected mortality. London is 
divided up into different regions called boroughs, and so 
he might measure the number of pubs and the number 
of deaths over a period of time in eight of his boroughs. 
The data are in a file called pubs.dat. 

The scatterplot of these data reveals that without the 
last case there is a perfect linear relationship (the dashed 
straight line). However, the presence of the last case 
(case 8) changes the line of best fit dramatically (although 



this line is still a significant fit to the data - do 
the regression analysis and see for yourself). 

What’s interesting about these data is 
when we look at the residuals and influence 
statistics. The residual for case 8 is the second smallest: 
this outlier produces a very small residual (most of the 
non-outliers have larger residuals) because it sits very 
close to the line that has been fitted to the data. How can 
this be? Look at the influence statistics below and you'll 
see that they’re massive for case 8: it exerts a huge influ¬ 
ence over the model. 




Residual 

Cook's Distance 

Leverage (Hat Value) 

DFBeta (Intercept) 

DFBeta (Pubs) 

1 

-2495.34 

0.21 

0.17 

-509.62 

1.39 

2 

-1638.73 

0.09 

0.16 

-321.10 

0.80 

3 

-782.12 

0.02 

0.15 

-147.08 

0.33 

4 

74.49 

0.00 

0.14 

13.47 

-0.03 

5 

931.10 

0.02 

0.14 

161.47 

-0.27 

6 

1787.71 

0.08 

0.13 

297.70 

-0.41 

7 

2644.32 

0.17 

0.13 

422.68 

-0.44 

8 

-521.42 

227.14 

0.99 

3351.53 

-85.65 


As always when you see a statistical oddity, you 
should ask what was happening in the real world. The 
last data point represents the City of London, a tiny 
area of only 1 square mile in the centre of London 
where very few people lived but where thousands of 
commuters (even then) came to work and had lunch 
in the pubs. Hence the pubs didn't rely on the resident 


population for their business and the residents didn’t 
consume all of their beer! Therefore, there was a mas¬ 
sive number of pubs. 

This illustrates that a case exerting a massive influ¬ 
ence can produce a small residual - so look at both. (I’m 
very grateful to David Hitchin for this example, and he in 
turn got it from Dr Richard Roberts.) 


7.7.I.I. Outliers and residuals <D 


An outlier is a case that differs substantially from the main trend of the data (see Jane 
Superbrain Box 4.1). Figure 7.9 shows an example of such a case in regression. Outliers 
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can cause your model to be biased because they affect the values of the estimated regression 
coefficients. For example, Figure 7.9 uses the same data as Figure 7.3 except that the 
score of one participant has been changed to be an outlier (in this case a person who was 
very calm in the presence of a very big spider). The change in this one point has had a 
dramatic effect on the regression model chosen to fit the data. With the outlier present, 
the regression model changes: its gradient is reduced (the line becomes flatter) and the 
intercept increases (the new line will cross the Y-axis at a higher point). It should be clear 
from this diagram that it is important to try to detect outliers to see whether the model is 
biased in this way. 


FIGURE 7.9 
Graph 

demonstrating the 
effect of an outlier. 
The dashed line 
represents the 
original regression 
line for these data 
(see Figure 7.3), 
whereas the solid 
line represents 
the regression line 
when an outlier is 
present 



Size of Spider 


How do you think that you might detect an outlier? Well, we know that an outlier, by its 
nature, is very different from all of the other scores. This being true, do you think that the 
model will predict that person’s score very accurately? The answer is no : looking at Figure 
7.9, it is evident that even though the outlier has biased the model, the model still predicts 
that one value very badly (the regression line is long way from the outlier). Therefore, if 
we were to work out the differences between the data values that were collected, and the 
values predicted by the model, we could detect an outlier by looking for large differences. 
This process is the same as looking for cases that the model predicts inaccurately. The dif¬ 
ferences between the values of the outcome predicted by the model and the values of the 
outcome observed in the sample are known as residuals. These residuals represent the error 
present in the model. If a model fits the sample data well then all residuals will be small 
(if the model was a perfect fit to the sample data - all data points fall on the regression 
line - then all residuals would be zero). If a model is a poor fit to the sample data then 
the residuals will be large. Also, if any cases stand out as having a large residual, then they 
could be outliers. 

The normal or unstandardized residuals described above are measured in the same units as 
the outcome variable and so are difficult to interpret across different models. What we can do 
is to look for residuals that stand out as being particularly large. However, we cannot define 
a universal cut-off point for what constitutes a large residual. To overcome this problem, we 
use standardized residuals, which are the residuals divided by an estimate of their standard 
deviation. We came across standardization in section 6.3.2 as a means of converting variables 
into a standard unit of measurement (the standard deviation); we also came across z-scores 
(see section 1.7.4) in which variables are converted into standard deviation units (i.e., they’re 
converted into scores that are distributed around a mean of 0 with a standard deviation of 1). 
By converting residuals into ^-scores (standardized residuals) we can compare residuals from 
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different models and use what we know about the properties of z-scores to devise universal 
guidelines for what constitutes an acceptable (or unacceptable) value. For example, we know 
from Chapter 1 that in a normally distributed sample, 95% of ^-scores should lie between 
— 1.96 and +1.96, 99% should lie between —2.58 and +2.58, and 99.9% (i.e., nearly all of 
them) should lie between —3.29 and +3.29. Some general rules for standardized residuals are 
derived from these facts: (1) standardized residuals with an absolute value greater than 3.29 
(we can use 3 as an approximation) are cause for concern because in an average sample a 
value this high is unlikely to happen by chance; (2) if more than 1% of our sample cases have 
standardized residuals with an absolute value greater than 2.58 (we usually just say 2.5) there 
is evidence that the level of error within our model is unacceptable (the model is a fairly poor 
fit of the sample data); and (3) if more than 5% of cases have standardized residuals with an 
absolute value greater than 1.96 (we can use 2 for convenience) then there is also evidence 
that the model is a poor representation of the actual data. 


7.7.I.2. Influential cases (D 


As well as testing for outliers by looking at the error in the model, it is also possible to look 
at whether certain cases exert undue influence over the parameters of the model. So, if we 
were to delete a certain case, would we obtain different regression coefficients? This type 
of analysis can help to determine whether the regression model is stable across the sample, 
or whether it is biased by a few influential cases. Again, this process will unveil outliers. 

There are several residual statistics that can be used to assess the influence of a particular 
case. One statistic is the adjusted predicted value for a case when that case is excluded from the 
analysis. In effect, the computer calculates a new model without a particular case and then uses 
this new model to predict the value of the outcome variable for the case that was excluded. 
If a case does not exert a large influence over the model then we would expect the adjusted 
predicted value to be very similar to the predicted value when the case is included. Put sim¬ 
ply, if the model is stable then the predicted value of a case should be the same regardless of 
whether or not that case was used to calculate the model. The difference between the adjusted 
predicted value and the original predicted value is known as DFFit (see below). We can also 
look at the residual based on the adjusted predicted value: that is, the difference between the 
adjusted predicted value and the original observed value. When this residual is divided by the 
standard error it gives a standardized value known as the studentized residual. This residual 
can be compared across different regression analyses because it is measured in standard units, 
and is called a studentized residual because it follows a Student’s 7-distribution. 

The studentized residuals are very useful to assess the influence of a case on the ability 
of the model to predict that case. However, they do not provide any information about 
how a case influences the model as a whole (i.e., the impact that a case has on the model’s 
ability to predict all cases). One statistic that does consider the effect of a single case on the 
model as a whole is Cook’s distance. Cook’s distance is a measure of the overall influence 
of a case on the model, and Cook and Weisberg (1982) have suggested that values greater 
than 1 may be cause for concern. 

A second measure of influence is hat values (sometimes called leverage), which gauge 
the influence of the observed value of the outcome variable over the predicted values. The 
average leverage value is defined as ( k+l)/n , in which k is the number of predictors in the 
model and n is the number of participants. 7 Leverage values can lie between 0 (indicating 
that the case has no influence whatsoever) and 1 (indicating that the case has complete 



7 You may come across the average leverage denoted as pin, in which p is the number of parameters being 
estimated. In multiple regression, we estimate parameters for each predictor and also for a constant and so p is 
equivalent to the number of predictors plus one (k + 1). 
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influence over prediction). If no cases exert undue influence over the model then we would 
expect all of the leverage values to be close to the average value ((k + 1 )/n). Hoaglin and 
Welsch (1978) recommend investigating cases with values greater than twice the average 
(2(k + 1)/«) and Stevens (2002) recommends using three times the average (3(k + 1)/«) as 
a cut-off point for identifying cases having undue influence. We will see how to use these 
cut-off points later. However, cases with large leverage values will not necessarily have a 
large influence on the regression coefficients because they are measured on the outcome 
variables rather than the predictors. 

It is possible to run the regression analysis with a case included and then rerun the ana¬ 
lysis with that same case excluded. If we did this, undoubtedly there would be some differ¬ 
ence between the b coefficients in the two regression equations. This difference would tell 
us how much influence a particular case has on the parameters of the regression model. To 
take a hypothetical example, imagine two variables that had a perfect negative relationship 
except for a single case (case 30). If a regression analysis was done on the 29 cases that were 
perfectly linearly related then we would get a model in which the predictor variable X per¬ 
fectly predicts the outcome variable Y, and there are no errors. If we then ran the analysis 
but this time include the case that didn’t conform (case 30), then the resulting model would 
have different parameters. Some data are stored in the file dfbeta.dat that illustrate such 
a situation. Try running a simple regression first with all the cases included and then with 
case 30 deleted. The results are summarized in Table 7.1, which shows: (1) the parameters 
for the regression model when the extreme case is included or excluded; (2) the resulting 
regression equations; and (3) the value of Y predicted from participant 30’s score on the X 
variable (which is obtained by replacing the X in the regression equation with participant 
30’s score for X, which was 1). 


Table 7.1 The difference in the parameters of the regression model when one case is excluded 


Parameter (b) 

Case 30 Included 

Case 30 Excluded 

Difference 

Constant (intercept) 

29.00 

31.00 

-2.00 

Predictor (gradient) 

-0.90 

-1.00 

0.10 

Model (regression line): 

Y=(-0.9)X + 29 

y=H)x+3i 


Predicted Y 

28.10 

30.00 

-1.90 


When case 30 is excluded, these data have a perfect negative relationship; hence the 
coefficient for the predictor (bj is —1 (remember that in simple regression this term is the 
same as Pearson’s correlation coefficient), and the coefficient for the constant (the inter¬ 
cept, b 0 ) is 31. However, when case 30 is included, both parameters are reduced 8 and the 
difference between the parameters is also displayed. The difference between a parameter 
estimated using all cases and estimated when one case is excluded is known as the DFBeta 
in R. DFBeta is calculated for every case and for each of the parameters in the model. So, 
in our hypothetical example, the DFBeta for the constant is —2, and the DFBeta for the 
predictor variable is 0.1. By looking at the values of DFBeta, it is possible to identify cases 
that have a large influence on the parameters of the regression model. 

A related statistic is the DFFit, which is the difference between the predicted value for 
a case when the model is calculated including that case and when the model is calculated 
excluding that case: in this example the value is —1.90 (see Table 7.1). If a case is not 


8 The value of b 1 is reduced in absolute size because the data no longer have a perfect linear relationship and so 
there is now variance that the model cannot explain. 
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influential then its DFFit should be zero - hence, we expect non-influential cases to have 
small DFFit values. 


7.7.I.3. A final comment on diagnostic statistics © 

There are a lot of diagnostic statistics that should be examined after a regression analysis, 
and it is difficult to summarize this wealth of material into a concise conclusion. Flowever, 
one thing I would like to stress is a point made by Belsey, Kuh, and Welsch (1980) who 
noted the dangers inherent in these procedures. The point is that diagnostics are tools that 
enable you to see how good or bad your model is in terms of fitting the sampled data. They 
are a way of assessing your model. They are not, however, a way of justifying the removal 
of data points to effect some desirable change in the regression parameters (e.g., deleting a 
case that changes a non-significant b-value into a significant one). Stevens (2002, p. 135), 
as ever, offers excellent advice: 

If a point is a significant outlier on Y, but its Cook’s distance is < 1, there is no real 
need to delete that point since it does not have a large effect on the regression analysis. 
However, one should still be interested in studying such points further to understand 
why they did not fit the model. 



Assessing the regression model II: generalization © 


When a regression analysis is done, an equation can be produced that is correct for the 
sample of observed values. However, in the social sciences we are usually interested in gen¬ 
eralizing our findings outside the sample. So, although it can be useful to draw conclusions 
about a particular sample of people, it is usually more interesting if we can then assume 
that our conclusions are true for a wider population. For a regression model to generalize 
we must be sure that underlying assumptions have been met, and to test whether the model 
does generalize we can look at cross-validating it. 

7.7.2.I. Checking assumptions © 

To draw conclusions about a population based on a regression analysis done on a sample, 
several assumptions must be true (see Berry, 1993): 

• Variable types: All predictor variables must be quantitative or categorical (with 
two categories), and the outcome variable must be quantitative, continuous and 
unbounded. By ‘quantitative’ I mean that they should be measured at the interval 
level and by ‘unbounded’ I mean that there should be no constraints on the variability 
of the outcome. If the outcome is a measure ranging from 1 to 10 yet the data col¬ 
lected vary between 3 and 7, then these data are constrained. 

• Non-zero variance: The predictors should have some variation in value (i.e., they do 
not have variances of 0). 

• No perfect multicollinearity: There should be no perfect linear relationship between 
two or more of the predictors. So, the predictor variables should not correlate too 
highly (see section 7.7.2.4). 

• Predictors are uncorrelated with ‘external variables’: External variables are variables 
that haven’t been included in the regression model which influence the outcome 
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variable. 9 These variables can be thought of as similar to the ‘third variable’ that was 
discussed with reference to correlation. This assumption means that there should be 
no external variables that correlate with any of the variables included in the regres¬ 
sion model. Obviously, if external variables do correlate with the predictors, then the 
conclusions we draw from the model become unreliable (because other variables exist 
that can predict the outcome just as well). 

• Homoscedasticity: At each level of the predictor variable(s), the variance of the resid¬ 
ual terms should be constant. This just means that the residuals at each level of the 
predictor(s) should have the same variance (homoscedasticity); when the variances 
are very unequal there is said to be heteroscedasticity (see section 5.7 as well). 

• Independent errors: For any two observations the residual terms should be uncorre¬ 
lated (or independent). This eventuality is sometimes described as a lack of autocor¬ 
relation. This assumption can be tested with the Durbin-Watson test, which tests for 
serial correlations between errors. Specifically, it tests whether adjacent residuals are 
correlated. The test statistic can vary between 0 and 4, with a value of 2 meaning that 
the residuals are uncorrelated. A value greater than 2 indicates a negative correlation 
between adjacent residuals, whereas a value less than 2 indicates a positive correla¬ 
tion. The size of the Durbin-Watson statistic depends upon the number of predictors 
in the model and the number of observations. As a very conservative rule of thumb, 
values less than 1 or greater than 3 are definitely cause for concern; however, values 
closer to 2 may still be problematic depending on your sample and model. R also 
provides a p-value of the autocorrelation. Be very careful with the Durbin-Watson 
test, though, as it depends on the order of the data: if you reorder your data, you’ll 
get a different value. 

• Normally distributed errors: It is assumed that the residuals in the model are random, 
normally distributed variables with a mean of 0. This assumption simply means that 
the differences between the model and the observed data are most frequently zero or 
very close to zero, and that differences much greater than zero happen only occasion¬ 
ally. Some people confuse this assumption with the idea that predictors have to be 
normally distributed. Predictors do not need to be normally distributed (see section 
7.12). 

• Independence: It is assumed that all of the values of the outcome variable are inde¬ 
pendent (in other words, each value of the outcome variable comes from a separate 
entity). 

• Linearity: The mean values of the outcome variable for each increment of the 
predictor(s) he along a straight line. In plain English this means that it is assumed that 
the relationship we are modelling is a linear one. If we model a non-linear relation¬ 
ship using a linear model then this obviously limits the generalizability of the findings. 

This list of assumptions probably seems pretty daunting but, as we saw in Chapter 5, 
assumptions are important. When the assumptions of regression are met, the model that 
we get for a sample can be accurately applied to the population of interest (the coefficients 
and parameters of the regression equation are said to be unbiased ). Some people assume 
that this means that when the assumptions are met the regression model from a sample is 
always identical to the model that would have been obtained had we been able to test the 


9 Some authors choose to refer to these external variables as part of an error term that includes any random factor 
in the way in which the outcome varies. However, to avoid confusion with the residual terms in the regression 
equations I have chosen the label ‘external variables’. Although this term implicitly washes over any random 
factors, I acknowledge their presence here. 
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entire population. Unfortunately, this belief isn’t true. What an unbiased model does tell 
us is that on average the regression model from the sample is the same as the population 
model. However, you should be clear that even when the assumptions are met, it is possible 
that a model obtained from a sample may not be the same as the population model - but 
the likelihood of them being the same is increased. 


7 . 7 . 2 . 2 . Cross-validation of the model (D 


Even if we can’t be confident that the model derived from our sample accurately repre¬ 
sents the entire population, there are ways in which we can assess how well our model can 
predict the outcome in a different sample. Assessing the accuracy of a model across dif¬ 
ferent samples is known as cross-validation. If a model can be generalized, then it must be 
capable of accurately predicting the same outcome variable from the same set of predictors 
in a different group of people. If the model is applied to a different sample and there is a 
severe drop in its predictive power, then the model clearly does not generalize. As a first 
rule of thumb, we should aim to collect enough data to obtain a reliable regression model 
(see the next section). Once we have a regression model there are two main methods of 
cross-validation: 

• Adjusted R 1 : In R, not only is the value of R 2 calculated, but also an adjusted R 2 . This 
adjusted value indicates the loss of predictive power or shrinkage. Whereas R 1 tells 
us how much of the variance in Y is accounted for by the regression model from our 
sample, the adjusted value tells us how much variance in Y would be accounted for if 
the model had been derived from the population from which the sample was taken. 
R derives the adjusted R 2 using Wherry’s equation. However, this equation has been 
criticized because it tells us nothing about how well the regression model would 
predict an entirely different set of data (how well can the model predict scores of a 
different sample of data from the same population?). One version of R 2 that does tell 
us how well the model cross-validates uses Stein’s formula (see Stevens, 2002): 


adjusted R 2 = 1 — 


n — 1 
n — k — 1 


n— 2 
n — k — 2 



(7.11) 


In Stein’s equation, R 2 is the unadjusted value, n is the number of participants and k is 
the number of predictors in the model. For the more mathematically minded of you, 
it is worth using this equation to cross-validate a regression model. 

• Data splitting: This approach involves randomly splitting your data set, computing 
a regression equation on both halves of the data and then comparing the resulting 
models. When using stepwise methods, cross-validation is a good idea; you should 
run the stepwise regression on a random selection of about 80% of your cases. Then 
force this model on the remaining 20% of the data. By comparing values of the R 2 
and ^-values in the two samples you can tell how well the original model generalizes 
(see Tabachnick &C Fidell, 2007, for more detail). 


7.7.2.3. Sample size in regression ® 

In the previous section I said that it’s important to collect enough data to obtain a reliable 
regression model. Well, how much is enough? You’ll find a lot of rules of thumb float¬ 
ing about, the two most common being that you should have 10 cases of data for each 
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predictor in the model, or 15 cases of data per predictor. So, with five predic¬ 
tors, you’d need 50 or 75 cases respectively (depending on the rule you use). 
These rules are very pervasive (even I used the 15 cases per predictor rule in 
the first edition of this book) but they oversimplify the issue considerably. In 
fact, the sample size required will depend on the size of effect that we’re trying 
to detect (i.e., how strong the relationship is that we’re trying to measure) and 
how much power we want to detect these effects. The simplest rule of thumb 
is that the bigger the sample size, the better! The reason is that the estimate of 
R that we get from regression is dependent on the number of predictors, k, and 
the sample size, N. In fact the expected R for random data is k/(N— 1), and so with small 
sample sizes random data can appear to show a strong effect: for example, with six pre¬ 
dictors and 21 cases of data, R = 6/(21 — 1) = .3 (a medium effect size by Cohen’s criteria 
described in section 6.3.2). Obviously for random data we’d want the expected R to be 0 
(no effect) and for this to be true we need large samples (to take the previous example, if 
we had 100 cases, not 21, then the expected R would be a more acceptable .06). 

It’s all very well knowing that larger is better, but researchers usually need some more 
concrete guidelines (much as we’d all love to collect 1000 cases of data, it isn’t always prac¬ 
tical). Green (1991) makes two rules of thumb for the minimum acceptable sample size, the 
first based on whether you want to test the overall fit of your regression model (i.e., test 
the R 1 ), and the second based on whether you want to test the individual predictors within 
the model (i.e., test the ^-values of the model). If you want to test the model overall, then 
he recommends a minimum sample size of 50 + 8k, where k is the number of predictors. 
So, with five predictors, you’d need a sample size of 50 + 40 = 90. If you want to test the 
individual predictors then he suggests a minimum sample size of 104 + k, so again taking 
the example of five predictors you’d need a sample size of 104 + 5 = 109. Of course, in 
most cases we’re interested both in the overall fit and in the contribution of individual pre¬ 
dictors, and in this situation Green recommends you calculate both of the minimum sample 
sizes I’ve just described, and use the one that has the largest value (so in the five-predictor 
example, we’d use 109 because it is bigger than 90). 

Now, these guidelines are all right as a rough and ready guide, but they still oversimplify 
the problem. As I’ve mentioned, the sample size required actually depends on the size of 
the effect (i.e., how well our predictors predict the outcome) and how much statistical 
power we want to detect these effects. Miles and Shevlin (2001) produce some extremely 
useful graphs that illustrate the sample sizes needed to achieve different levels of power, for 
different effect sizes, as the number of predictors vary. For precise estimates of the sample 
size you should be using, I recommend using these graphs. I’ve summarized some of the 
general findings in Figure 7.10. This diagram shows the sample size required to achieve a 
high level of power (I’ve taken Cohen’s, 1988, benchmark of .8) depending on the number 
of predictors and the size of expected effect. To summarize the graph very broadly: (1) if 
you expect to find a large effect then a sample size of 80 will always suffice (with up to 20 
predictors) and if there are fewer predictors then you can afford to have a smaller sample; 
(2) if you’re expecting a medium effect, then a sample size of 200 will always suffice (up 
to 20 predictors), you should always have a sample size above 60, and with six or fewer 
predictors you’ll be fine with a sample of 100; and (3) if you’re expecting a small effect size 
then just don’t bother unless you have the time and resources to collect at least 600 cases 
of data (and many more if you have six or more predictors). 



7.7.2A. Multicollinearity © 

Multicollinearity exists when there is a strong correlation between two or more predic¬ 
tors in a regression model. Multicollinearity poses a problem only for multiple regres¬ 
sion because (without wishing to state the obvious) simple regression requires only one 
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predictor. Perfect collinearity exists when at least one predictor is a perfect linear combina¬ 
tion of the others (the simplest example being two predictors that are perfectly correlated - 
they have a correlation coefficient of 1). If there is perfect collinearity between predictors it 
becomes impossible to obtain unique estimates of the regression coefficients because there 
are an infinite number of combinations of coefficients that would work equally well. Put 
simply, if we have two predictors that are perfectly correlated, then the values of b for each 
variable are interchangeable. The good news is that perfect collinearity is rare in real-life 
data. The bad news is that less than perfect collinearity is virtually unavoidable. Low levels 
of collinearity pose little threat to the models generated by R, but as collinearity increases 
there are three problems that arise: 

• Untrustworthy bs: As collinearity increases so do the standard errors of the b coef¬ 
ficients. If you think back to what the standard error represents, then big stan¬ 
dard errors for b coefficients means that these bs are more variable across samples. 
Therefore, it means that the b coefficient in our sample is less likely to represent 
the population. Crudely put, multicollinearity means that the 6-values are less trust¬ 
worthy. Don’t lend them money and don’t let them go for dinner with your boy- or 
girlfriend. Of course if the bs are variable from sample to sample then the resulting 
predictor equations will be unstable across samples too. 

• It limits the size of R: Remember that R is a measure of the multiple correlation 
between the predictors and the outcome and that R 2 indicates the variance in the 
outcome for which the predictors account. Imagine a situation in which a single 
variable predicts the outcome variable fairly successfully (e.g., R = .80) and a second 
predictor variable is then added to the model. This second variable might account 
for a lot of the variance in the outcome (which is why it is included in the model), 
but the variance it accounts for is the same variance accounted for by the first vari¬ 
able. In other words, once the variance accounted for by the first predictor has been 
removed, the second predictor accounts for very little of the remaining variance (the 
second variable accounts for very little unique variance). Hence, the overall variance 
in the outcome accounted for by the two predictors is little more than when only one 
predictor is used (so R might increase from .80 to .82). This idea is connected to the 
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notion of partial correlation that was explained in Chapter 6. If, however, the two 
predictors are completely uncorrelated, then the second predictor is likely to account 
for different variance in the outcome to that accounted for by the first predictor. So, 
although in itself the second predictor might account for only a little of the variance 
in the outcome, the variance it does account for is different than that of the other 
predictor (and so when both predictors are included, R is substantially larger, say 
.95). Therefore, having uncorrelated predictors is beneficial. 

• Importance of predictors: Multicollinearity between predictors makes it difficult to 
assess the individual importance of a predictor. If the predictors are highly correlated, 
and each accounts for similar variance in the outcome, then how can we know which 
of the two variables is important? Quite simply, we can’t tell which variable is impor¬ 
tant - the model could include either one, interchangeably. 

One way of identifying multicollinearity is to scan a correlation matrix of all of the 
predictor variables and see if any correlate very highly (by ‘very highly’ I mean correla¬ 
tions of above .80 or .90). This is a good ‘ball park’ method but misses more subtle forms 
of multicollinearity. Luckily, R can produce various collinearity diagnostics, one of which 
is the variance inflation factor (VIF). The VIF indicates whether a predictor has a strong 
linear relationship with the other predictor(s). Although there are no hard and fast rules 
about what value of the VIF should cause concern, Myers (1990) suggests that a value of 10 
is a good value at which to worry. What’s more, if the average VIF is greater than 1, then 
multicollinearity may be biasing the regression model (Bowerman & O’Connell, 1990). 
Related to the VIF is the tolerance statistic, which is its reciprocal (1/VIF). As such, values 
below 0.1 indicate serious problems, although Menard (1995) suggests that values below 
0.2 are worthy of concern. 

If none of this has made any sense then have a look at Flutcheson and Sofroniou (1999, 
pp. 78-85) who give a really clear explanation of multicollinearity. 


7.8. How to do multiple regression using R 
Commander and R © 


7 . 8 . 1 . 


Some things to think about before the analysis © 


A good strategy to adopt with regression is to measure predictor variables for which there 
are sound theoretical reasons for expecting them to predict the outcome. Run a regression 
analysis in which all predictors are entered into the model and examine the output to see 
which predictors contribute substantially to the model’s ability to predict the outcome. 
Once you have established which variables are important, rerun the analysis including only 
the important predictors and use the resulting parameter estimates to define your regres¬ 
sion model. If the initial analysis reveals that there are two or more significant predictors, 
then you could consider running a forward stepwise analysis (rather than forced entry) to 
find out the individual contribution of each predictor. 

I have spent a lot of time explaining the theory behind regression and some of the 
diagnostic tools necessary to gauge the accuracy of a regression model. It is important 
to remember that R may appear to be very clever, but in fact it is not. Admittedly, it can 
do lots of complex calculations in a matter of seconds, but what it can’t do is control the 
quality of the model that is generated - to do this requires a human brain (and preferably a 
trained one). R will happily generate output based on any garbage you decide to feed into 
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it and will not judge the results or give any indication of whether the model can be general¬ 
ized or if it is valid. However, R provides the statistics necessary to judge these things, and 
at this point our brains must take over the job - which is slightly worrying (especially if 
your brain is as small as mine). 


7 . 8 . 2 . 


Multiple regression: running the basic model © 


7.8.2.1. Multiple regression using R Commander: 
the basic model © 


Imagine that the record company executive was now interested in extending the model of 
album sales to incorporate other variables. He decides to measure two new variables: (1) 
the number of times songs from the album are played on Radio 1 during the week prior 
to release (airplay); and (2) the attractiveness of the band (attract). Before an album is 
released, the executive notes the amount spent on advertising, the number of times songs 
from the album are played on radio the week before release, and the attractiveness of the 
band. He does this for 200 different albums (each made by a different band). Attractiveness 
was measured by asking a random sample of the target audience to rate the attractiveness 
of each band on a scale from 0 (hideous potato-heads) to 10 (gorgeous sex objects). The 
mode attractiveness given by the sample was used in the regression (because he was inter¬ 
ested in what the majority of people thought, rather than the average of people’s opinions). 
The data are in a file called Album Sales 2.dat. 


* 7 £ album2 j t=r || cH 1 1^1 



adverts 

sales 

airplay 

attract 


1 

10.256 

330 

43 

10 

- 

2 

985.685 

120 

28 

7 


3 

1445.563 

360 

35 

7 


4 

1188.193 

270 

33 

7 


5 

574.513 

220 

44 

5 


6 

568.954 

170 

19 

5 


7 

471.814 

70 

20 

1 


8 

537.352 

210 

22 

9 


9 

514.068 

200 

21 

7 


10 

174.093 

300 

40 

7 


11 

1720.806 

290 

32 

7 


12 

611.479 

70 

20 

2 


13 

251.192 

150 

24 

8 


14 

97.972 

190 

38 

6 


15 

406.814 

240 

24 

7 


16 

265.398 

100 

25 

5 


17 

1323.287 

250 

35 

5 


18 

196.650 

210 

36 

8 


19 

1326.598 

280 

27 

8 


20 

1380.689 

230 

33 

8 


21 

792.345 

210 

33 

7 


22 

957.167 

230 

28 

6 


23 

1789.659 

320 

30 

9 


24 

656.137 

210 

34 

7 


25 

613.697 

230 

49 

7 


26 

313.362 

250 

40 

8 


27 

336.510 

60 

20 

4 


28 

1544.899 

330 

42 

7 


29 

68.954 

150 

35 

8 


30 

785.692 

150 

8 

6 



FIGURE 7.11 

The Album Sales 
2.dat data 
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To conduct a multiple regression using R Commander, first initiate the package by exe¬ 
cuting (and install it if you haven’t - see section 3.6): 

library(Rcmdr) 

You can read the data file Album Sales 2.dat into R using the Data => Import data => 
from text file, clipboard, or URL... menu (see section 3.7.3). Then you can look at the 
data, by clicking on [««» da <» sa l . You should note that each variable has its own column (the 
same layout as for correlation) and each row represents a different album. So, the first 
album had £10,256 spent advertising it, sold 330,000 copies, received 43 plays on Radio 
1 the week before release, and was made by a band that the majority of people rated as 
gorgeous sex objects (Figure 7.11). 

The executive has past research indicating that advertising budget is a significant pre¬ 
dictor of album sales, and so he should include this variable in the model first. His new 
variables (airplay and attract) should, therefore, be entered into the model after advertis¬ 
ing budget. This method is hierarchical (the researcher decides in which order to enter 
variables into the model based on past research). The record executive needs to run two 
models. In his first model, the predictor will be adverts. In the second model, the predic¬ 
tors will be adverts, airplay and attract. 


FIGURE 7.12 

Dialog boxes 
for conducting 
multiple 

regression using 
R Commander 



We can use R Commander to run the model by selecting Statistics => Fit models => 
Linear regression... (Figure 7.12). For the first model (left dialog box in Figure 7.12) we 
select sales as the response variable, and adverts as the explanatory variable. We have 
named this model albumSales.2. When you have selected your variables and named the 
model, click on 0K I. The resulting output is described in section 7.8.3.1. 

For the second model we choose three explanatory variables, adverts, attract and sales. 
To select multiple variables you can either ‘swipe’ over all the variables you are interested 
in with the mouse (if they are next to each other), or hold down the Ctrl key ( cmd on a 
Mac) while you click on each one (if they are not next to each other). When you have 
selected your variables and named the model, click on 0K I. The resulting output is also 
described in section 7.8.3.1. 
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7.8.2.2. Multiple regression using R: the basic model © 


First load the data file by setting your working directory to the location of the file (see sec¬ 
tion 3.4.4) and executing: 

albumZ<-read.delimC"Album Sales 2.dat", header = TRUE) 

We can again run the regression analysis using the lm() function. We need to create two 
models: the first, albumSales.2, will have adverts as a predictor. The second, albumSales.3, 
will have adverts, airplay and attract as predictors. 

The first model is the same as the one we created in section 7.4.2 and we can create it by 
executing the following command: 

albumSales.2 <- lm(sales ~ adverts, data = album2) 

To remind you, this creates a model called albumSales.2, in which the variable sales is 
predicted from the variable adverts ( sales ~ adverts). The data = simply tells R which 
dataframe contains the variables we’re using in the model. 

To create the second model, we need to specify additional predictors, and we can do 
this in the same way that we added predictors to the regression equation itself: we simply 
use *+’ to add them into the model. Therefore, if we want to predict sales from the vari¬ 
ables adverts, airplay and attract, then our model is specified as sales ~ adverts + airplay + 
attract. It basically looks the same as the regression equation but without the bs. Therefore, 
to create this model we would execute: 


albumSales.3 <- lm(sales ~ adverts + airplay + attract, data = album2) 


This command creates a model albumSales.3 , in which the variable sales is predicted from 
the variables adverts, airplay and attract. We could also have used the update() function 
to do this because this model is simply adding new predictors to the previous model (R’s 
Souls’ Tip 19.3). 



The update() function© 


Writing out the models in full can be helpful to understand how the lm() function works: I think it’s useful to see 
how the code relates to the equation that describes the model. However, the updatef) function is a quicker way 
to add new things to old models. In our example our model albumSales.3 is the same as the previous model, 
albumSales.2, except that we added two variables (attract and airplay). Look at the two model specifications: 

albumSales.2 <- lm(sales ~ adverts, data = album2) 

albumSales.3 <- lm(sales ~ adverts + airplay + attract, data = album2) 

Note that they are identical except that the second model has two new variables added as predictors. Using the 
update() function we can create the second model in less text: 

albumSales.3<-update(albumSales.2, + airplay + attract) 

This function, like the longhand one, creates a new model called albumSales.3, and it does this by updating an 
existing model. The first part of the parenthesis tells R which model to update (in this case we want to update the 
model called albumSales.2). The.—. means ‘keep the outcome and predictors the same as the baseline model’: 
the dots mean ‘keep the same’ so the fact that we put dots on both sides of the ~ means that we want to keep 
both the outcome and predictors the same as in the baseline model. The + airplay + attract means ‘add airplay 
and attract as predictors’. Therefore ,+ airplay + attract’ can be interpreted as ‘keep the same outcomes and 
predictors as the baseline model but add airplay and attract as predictors. 
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7 . 8 . 3 . 


Interpreting the basic multiple regression © 


7.8.3.I. The model summary © 


To see the output of these models we need to use the summary() function (in which we 
simply place the name of the model). To see the output of our models, execute: 

summaryCalbumSales.2) 
summaryCalbumSales.3) 

The summary of the albumSales.2 model is shown in Output 7.2, whereas the summary of 
albumSales.3 is in Output 7.3. 

Call: lm(formula = sales ~ adverts, data = album2) 

Residuals: 

Min IQ Median 3Q Max 

-152.949 -43.796 -0.393 37.040 211.866 

Coefficients: 

Estimate Std. Error t value Pr(>|t|) 

(Intercept) 1.341e+02 7.537e+00 17.799 <2e-16 *** 
adverts 9.612e-02 9.632e-03 9.979 <2e-16 *** 


Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 , , 1 

Residual standard error: 65.99 on 198 degrees of freedom 
Multiple R-squared: 0.3346, Adjusted R-squared: 0.3313 

F-statistic: 99.59 on 1 and 198 DF, p-value: < 2.2e-16 

Output 7.2 

Call: lm(formula = sales — adverts + airplay + attract, data = album2) 


Residuals: 

Min IQ Median 

-121.324 -28.336 -0.451 


3Q Max 

28.967 144.132 


Estimate Std. Error 


Coefficients 

(Intercept) 

adverts 

airplay 

attract 


-26.612958 

0.084885 

3.367425 

11.086335 


17.350001 

0.006923 

0.277771 

2.437849 


t value Pr(>|t|) 
-1.534 0.127 

12.261 < 2e-16 

12.123 < 2e-16 

4.548 9.49e-06 


* * * 
* * * 
* * * 


Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ' 1 


Residual standard error: 47.09 on 196 degrees of freedom 
Multiple R-squared: 0.6647, Adjusted R-squared: 0.6595 

F-statistic: 129.5 on 3 and 196 DF, p-value: < 2.2e-16 

Output 7.3 


Let’s look first at the R 2 statistics at the bottom of each summary. This value describes the over¬ 
all model (so it tells us whether the model is successful in predicting album sales). Remember that 
we ran two models: albumSales.2 refers to the first stage in the hierarchy when only advertising 
budget is used as a predictor, albumSales.3 refers to when all three predictors are used. At the 
beginning of each output, R reminds us of the command that we ran to get each model. 
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When only advertising budget is used as a predictor, the R 2 statistic is the square of 
simple correlation between advertising and album sales (0.578 2 ). In fact all of the statistics 
for albumSales.2 are the same as the simple regression model earlier (see albumSales.l in 
section 7.5). The value of R 2 , we already know, is a measure of how much of the variability 
in the outcome is accounted for by the predictors. For the first model its value is .335, 
which means that advertising budget accounts for 33.5% of the variation in album sales. 
However, when the other two predictors are included as well ( albumSales.3 in Output 
7.3), this value increases to .665, or 66.5% of the variance in album sales. Therefore, if 
advertising accounts for 33.5%, we can tell that attractiveness and radio play account for 
an additional 33.0%. 10 So the inclusion of the two new predictors has explained quite a 
large amount of the variation in album sales. 

The adjusted R 2 gives us some idea of how well our model generalizes, and ideally we would 
like its value to be the same, or very close to, the value of R 1 . In this example the difference for 
the final model (Output 7.3) is small (in fact the difference between the values is .665 —.660 = 
.005 (about 0.5%)). This shrinkage means that if the model were derived from the population 
rather than a sample it would account for approximately 0.5% less variance in the outcome. 
Advanced students might like to apply Stein’s formula to the R 1 to get some idea of its likely 
value in different samples. Stein’s formula was given in equation (7.11) and can be applied by 
replacing n with the sample size (200) and k with the number of predictors (3): 


adjusted I? 2 = 1 — 


200-1 


■ x 


200-2 200 + 1 
- x 


200-3-1 200-3-2 200 

= 1- (1.015 x 1.015 x 1.005) x 0.335 
= 1-0.347 
= 0.653 


(1-0.665) 


This value is very similar to the observed value of R 2 (.665), indicating that the cross¬ 
validity of this model is very good. 



variance explained by the model. Multiply this value by 100 to give the percentage of variance explained by the model. 


7.8.3.2. Model parameters (D 


So far we have looked at the overall fit of the model. The next part of the output to consider 
is the parameters of the model. Now, the first step in our hierarchy was to include advertis¬ 
ing budget (as we did for the simple regression earlier in this chapter) and so the parameters 
for the first model are identical to the parameters obtained in Output 7.1. Therefore, we 


10 That is, 33% = 66.5%-33.5% (this value is known as the R 2 change). 
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will be concerned only with the parameters for the final model (in which all predictors were 
included). Output 7.3 shows the estimate, standard error, t-value and p-value. 

Remember that in multiple regression the model takes the form of equation (7.9) and in 
that equation there are several unknown quantities (the 6-values). The first column gives 
us estimates for these 6-values, and these values indicate the individual contribution of 
each predictor to the model (notice that in the first model, R is using the slightly annoying 
1.341e+02 notation, which means ‘move the decimal point two places to the right’, so this 
value is equal to 134.1). If we replace the 6-values in equation (7.9) we find that we can 
define the model as follows: 

sales ; = b 0 + (^advertising^ 6 2 airplay. + 6 3 attractiveness ; 

= -26.61 + (0.08 advertising.) + (3.37 airplay.) + (11.09 attractiveness^) 

The 6-values tell us about the relationship between album sales and each predictor. If the 
value is positive we can tell that there is a positive relationship between the predictor and 
the outcome, whereas a negative coefficient represents a negative relationship. For these 
data all three predictors have positive 6-values indicating positive relationships. So, as 
advertising budget increases, album sales increase; as plays on the radio increase, so do 
album sales; and finally more attractive bands will sell more albums. The 6-values tell us 
more than this, though. They tell us to what degree each predictor affects the outcome if 
the effects of all other predictors are held constant : 

• Advertising budget ( b = 0.085): This value indicates that as advertising budget 
increases by one unit, album sales increase by 0.085 units. Both variables were mea¬ 
sured in thousands; therefore, for every £1000 more spent on advertising, an extra 
0.085 thousand albums (85 albums) are sold. This interpretation is true only if the 
effects of attractiveness of the band and airplay are held constant. 

• Airplay (b = 3.367): This value indicates that as the number of plays on radio in the 
week before release increases by one, album sales increase by 3.367 units. Therefore, 
every additional play of a song on radio (in the week before release) is associated with 
an extra 3.367 thousand albums (3367 albums) being sold. This interpretation is true 
only if the effects of attractiveness of the band and advertising are held constant. 

• Attractiveness ( b = 11.086): This value indicates that a band rated one unit higher on 
the attractiveness scale can expect additional album sales of 11.086 units. Therefore, 
every unit increase in the attractiveness of the band is associated with an extra 11.086 
thousand albums (11,086 albums) being sold. This interpretation is true only if the 
effects of radio airplay and advertising are held constant. 

Each of these beta values has an associated standard error indicating to what extent 
these values would vary across different samples, and these standard errors are used to 
determine whether or not the 6-value differs significantly from zero. As we saw in section 
7.5.2, a t-statistic can be derived that tests whether a 6-value is significantly different from 
0. In simple regression, a significant value of t indicates that the slope of the regression 
line is significantly different from horizontal, but in multiple regression, it is not so easy to 
visualize what the value tells us. Well, it is easiest to conceptualize the t-tests as measures of 
whether the predictor is making a significant contribution to the model. Therefore, if the 
t-test associated with a 6-value is significant (if the value in the column labelled Pr(>\t\) 
is less than .05) then the predictor is making a significant contribution to the model. The 
smaller the value of Pr(> \ 1 1) (and the larger the value of t), the greater the contribution of 
that predictor. For this model, the advertising budget, t(196) = 12.26, p < .001, the amount 
of radio play prior to release, t(196) = 12.12, p < .001, and attractiveness of the band, 
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7(196) = 4.55, p < .001, are all significant predictors of album sales. 11 From the magnitude 
of the 7-statistics we can see that the advertising budget and radio play had a similar impact, 
whereas the attractiveness of the band had less impact. 

The ^-values and their significance are important statistics to look at; however, the 
standardized versions of the ^-values are in many ways easier to interpret (because they are 
not dependent on the units of measurement of the variables). To obtain the standardized 
beta estimates (usually denoted by /3 ; ) we need to use a function called lm.beta(). This is 
found in the QuantPsyc package, and so you need to install and load this package (see sec¬ 
tion 7.3). All we need to do is to specify our model within this function and then execute 
it. Therefore, to get standardized betas for the albumSales.3 model, we execute: 

1m.betafalbumSales.3) 

The resulting output is: 

adverts airplay attract 
0.5108462 0.5119881 0.1916834 

These estimates tell us the number of standard deviations by which the outcome will 
change as a result of one standard deviation change in the predictor. The standardized 
beta values are all measured in standard deviation units and so are directly comparable: 
therefore, they provide a better insight into the ‘importance’ of a predictor in the model. 
The standardized beta values for airplay and advertising budget are virtually identical 
(0.512 and 0.511, respectively) indicating that both variables have a comparable degree of 
importance in the model (this concurs with what the magnitude of the 7-statistics told us). 

• Advertising budget ( standardized /3 = .511): This value indicates that as advertising 
budget increases by one standard deviation (£485,655), album sales increase by 0.511 
standard deviations. The standard deviation for album sales is 80,699 and so this con¬ 
stitutes a change of 41,240 sales (0.511 x 80,699). Therefore, for every £485,655 
more spent on advertising, an extra 41,240 albums are sold. This interpretation is 
true only if the effects of attractiveness of the band and airplay are held constant. 

• Airplay ( standardized [3 = .512): This value indicates that as the number of plays on 
radio in the week before release increases by 1 standard deviation (12.27), album 
sales increase by 0.512 standard deviations. The standard deviation for album sales is 
80,699 and so this constitutes a change of 41,320 sales (0.512 x 80,699). Therefore, 
if Radio 1 plays the song an extra 12.27 times in the week before release, 41,320 
extra album sales can be expected. This interpretation is true only if the effects of 
attractiveness of the band and advertising are held constant. 

• Attractiveness ( standardized /3 = .192): This value indicates that a band rated one 
standard deviation (1.40 units) higher on the attractiveness scale can expect addi¬ 
tional album sales of 0.192 standard deviations units. This constitutes a change of 
15,490 sales (0.192 x 80,699). Therefore, a band with an attractiveness rating 1.40 
higher than another band can expect 15,490 additional sales. This interpretation is 
true only if the effects of radio airplay and advertising are held constant. 

Next we need to think about the confidence intervals. We know the estimate, the stand¬ 
ard error of the estimate, and the degrees of freedom, and so it would be relatively straight¬ 
forward to calculate the confidence intervals for each estimate. It would be even more 


11 For all of these predictors I wrote £(196). The number in parentheses is the degrees of freedom. We saw in sec¬ 
tion 7.2.4 that in regression the degrees of freedom are N - p - 1, where N is the total sample size (in this case 
200) and p is the number of predictors (in this case 3). For these data we get 200 - 3 - 1 = 196. 
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straightforward to make R do that, using the confint() function. Again, we simply put the 
name of the regression model into the function and execute it; therefore, to get confidence 
intervals for the parameters in the model albumSales.3, we execute: 

confint(albumSales.B) 

The results are shown in Output 7.4. Imagine that we collected 100 samples of data mea¬ 
suring the same variables as our current model. For each sample we could create a regres¬ 
sion model to represent the data. If the model is reliable then we hope to find very similar 
parameters in all samples. Therefore, each sample should produce approximately the same 
^-values. The confidence intervals of the unstandardized beta values are boundaries con¬ 
structed such that in 95% of these samples these boundaries will contain the true value of b 
(see section 2.5.2). Therefore, if we’d collected 100 samples, and calculated the confidence 
intervals for b, we are saying that 95% of these confidence intervals would contain the 
true value of b. Therefore, we can be fairly confident that the confidence interval we have 
constructed for this sample will contain the true value of b in the population. This being 
so, a good model will have a small confidence interval, indicating that the value of b in 
this sample is close to the true value of b in the population. The sign (positive or negative) 
of the ^-values tells us about the direction of the relationship between the predictor and 
the outcome. Therefore, we would expect a very bad model to have confidence intervals 
that cross zero, indicating that in some samples the predictor has a negative relationship 
to the outcome whereas in others it has a positive relationship. In this model, the two best 
predictors (advertising and airplay) have very tight confidence intervals, indicating that 
the estimates for the current model are likely to be representative of the true population 
values. The interval for attractiveness is wider (but still does not cross zero), indicating that 
the parameter for this variable is less representative, but nevertheless significant. 


(Intercept) 

adverts 

airplay 

attract 


2.5 % 97.5 % 

-60.82960967 7.60369295 

0.07123166 0.09853799 

2.81962186 3.91522848 

6.27855218 15.89411823 


Output 7.4 



output from the summaryf) of the model. If you have done a hierarchical regression then look at the values for the final model. For 
each predictor variable, you can see if it has made a significant contribution to predicting the outcome by looking at the column 
labelled Pr(>\t\): values less than .05 are significant. You should also look at the standardized beta values because these tell 
you the importance of each predictor (bigger absolute value = more important). 


7.8.4. 


Comparing models (D 


We did a hierarchical regression, which means we need to compare the fit of the two 
models, and see if the R 2 is significantly higher in the second model than in the first. The 
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significance of R 2 can be tested using an F-ratio, and this F is calculated from the following 
equation (in which N is the number of cases or participants, and k is the number of predic¬ 
tors in the model): 

r ( N-k-l)R 2 

k(l-R z ) 

The first model ( albumSales.2 ) causes R 2 to change from 0 to .335, and this change 
in the amount of variance explained gives rise to an F-ratio of 99.59, which is signifi¬ 
cant with a probability less than .001. Bearing in mind for this first model that we have 
only one predictor (so k = 1) and 200 cases (N = 200), this F comes from the equation 
above: 


(200-1-1)0.334648 

1(1-0.334648) 


The addition of the new predictors ( albumSales.3 ) causes R 1 to increase by a further 
.330 (see above). We can calculate the F-ratio for this change using the same equation, 
but because we’re looking at the change in models we use the change in R 2 , R 2 change , 
and the R 2 in the new model (model 2 in this case, so I’ve called it R 2 ) and we also 
use the change in the number of predictors, & change (model 1 had one predictor and 
model 2 had three predictors, so the change in the number of predictors is 3 — 1=2), 
and the number of predictors in the new model, k 2 (in this case because we’re looking 
at model 2): 


, (N -k 2 - 1)-R 2 change 

^ Wl-*2 2 ) 

(200-3-l)x 0,330 
2(1-0.664668) 

= 96.44 


The degrees of freedom for this change are & change (in this case 2) and N - k 2 - 1 (in this case 
196). As such, the change in the amount of variance that can be explained is significant, 
F(2, 196) = 96.44, p < .001. The change statistics therefore tell us about the difference 
made by adding new predictors to the model. 


7.8.4.I. Comparing models with R Commander (D 

To compare two hierarchical models using R Commander we choose Models => 
Hypothesis tests => Compare two models... (Figure 7.13). Note that there are two 
lists of models that we have previously created in the current session of R Commander. 
Both lists contain the same models, which are the albumSales.l, albumSales.2, and 
albumSales.3 models. We want to compare albumSales.l with albumSales.3 so we need 
to click on albumSales.l in the list labelled First model (pick one ) and then click on 
albumSales.3 in the list labelled Second model (pick one). Once the two models are 
selected, click on 1 0K I to make the comparison. The resulting output is described in 
the following section. 
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FIGURE 7.13 

Comparing 
regression 
models using R 
commander 



I ° I ® 


Compare Models 


First model (pick one) 
albumSales.l 


albumSales.2 


|albumSales.3 


□ 


Second model (pick one) 


albumSales.l 

albumSales.2 


albumSales.3 


□ 


OK j [ Cancel ] | Help 


7.8.4.2. Comparing models using R © 

To compare models using R we use the anova() function, which takes the general form: 

anova(model.l, model.2, ... , model.n) 

which simply means that we list the models that we want to compare in the order in which we 
want to compare them. It’s worth noting that we can only compare hierarchical models; that 
is to say, the second model must contain everything that was in the first model plus something 
new, and the third model must contain everything in the second model plus something new, 
and so on. Using this principle, we can compare albumSales.l with albumSales.3 by executing: 

anova(albumSciles. 2, albumSales. 3) 

Output 7.5 shows the results of this comparison. Note that the value of F is 96.44, which is 
the same as the value that we calculated by hand at the beginning of this section. The value 
in column labelled Pr(>F) is 2.2e-16 (i.e., 2.2 with the decimal place moved 16 places to 
the left, or a very small value indeed); we can say that albumSales.3 significantly improved 
the fit of the model to the data compared to albumSales.2, F( 2, 196) = 96.44, p < .001. 

Analysis of Variance Table 

Model 1: album2$sales ~ album2$adverts 

Model 2: album2$sales ~ album2$adverts + album2$airplay + album2$attract 
Res.Df RSS Df Sum of Sq F Pr(>F) 

1 198 862264 

2 196 434575 2 427690 96.447 < 2.2e-16 *** 


Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ' 1 



Output 7.5 


CRAMMING SAM’S TIPS 


Assessing model improvement in 
hierarchical regression 


If you have done a hierarchical regression then you can assess the improvement of the model at each stage of the analysis by 
looking at the change in R 2 and testing whether this change is significant using the anovaQ function. 
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7.9. Testing the accuracy of your regression 
model © 


Diagnostic tests using R Commander © 


R Commander will allow you to run a range of diagnostic tests and other modifications to 
your model - these are found in the Models menu, shown in Figure 7.14. The menus are 
listed below with a brief description of what functions they enable you to access. We will 
not look at any particular function in detail because, in general, we think it is quicker to 
use commands, and we outline a general strategy for testing the accuracy of your regression 
model in the next two sections. 


74 RCommander 

File Edit Data Statistics Graphs 

ix.. Dataset: album2 [ Edit data s 
andi I_I 

Script Window 


album2 <- read, cable ( "It : J\ 
sep^Yc", na. strin.gs= n Nl 
albumSales.3 <- lm(sales-”< 
summary (albumSale3.3) 


Distributions Tools Help 


Select active mo-del... 

Summarize model 

Add observation statistics to data... 

Confidence intervals... 

Akaike Information Criterion (AIC) 
Bayesian Information Criterion (BIC] 
Stepwise model selection... 

Subset model selection.., 


] 


", header=TRUE, 
album2) 


F^il a I Ha.1 



Hypothesis tests ► 

Numerical diagnostics 

Gra phs ► 

« 1 _ 

■ 

Output Window 

|lm(formula = sales ~ adverts + airplay + attract, dataf-- 


Variance-inflation factors 
Breusch-Pagan testfor heteroscedasticity... 


Durbin-Watson testfor autocorrelation... 


RESET testfor nonlinearity.., 
Bonferroni outlier test 


FIGURE 7.14 

Regression 
diagnostics using 
R Commander 


• Select active model...: This menu allows you to choose a regression model that you 
would like to get more information on. 

• Summarize model: This command produces a summary of the model, by running the 
summaryQ function. 

• Add observation statistics to data...: If you run this command, it will create the out¬ 
lier detection statistics for each case, and then it will merge these into your original 
dataframe, creating new variables in the dataframe called hatvalue, covratio, etc. 

• Confidence intervals...: Produces the confidence intervals for the model. 

• Akaike Information Criterion (AIC): This command will display the AIC for the 
model, which is used to select between models (see section 7.6.3). 

• Bayesian Information Criterion (BIC): We have not discussed the BIC in detail, but 
it is a similar measure to the AIC. 

• Stepwise model selection...: Used for stepwise model selection to add and remove 
variables from the model to try to obtain the best fit possible, with the fewest vari¬ 
ables. Usually not advised. 

• Subset model selection...: Slightly (but only slightly) better than stepwise model 
selection, this tries combinations of variables to try to obtain the best fit, with various 
penalties for having too many variables. 

• Hypothesis tests: This menu has three options. The first (ANOVA table...) produces 
sums of squares and T-statistics for each predictor variable in the model. You usually 
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want type III sums of squares (see Jane Superbrain Box 11.1). 12 The second option 
(Compare two models...) allows you to do hierarchical regression by comparing the 
fit of two models. Finally the third option (Linear hypothesis...) is involved in ana¬ 
lyses that go beyond the material in this chapter. 

• Numerical diagnostics: This gives a number of other diagnostic tests, of which we 
have covered the Variance inflation factors (VIF) and Durbin-Watson test. 

• Graphs: Several diagnostic graphs are available. It might surprise you, given its length 
and how long it has taken you to read, that there is anything not covered in this chapter, 
but we do not cover these graphs. 


7.9.2. 


Outliers and influential cases (D 


The diagnostics that we have examined so far all relate to either the overall model, 
or to a specific variable. The other type of diagnostic we can look at relates to cases: 
each case (making up a row in the data set) has a value, hence these are called casewise 
diagnostics. These diagnostics were described in section 7.7.1. There are various func¬ 
tions that we can use to obtain different casewise diagnostics and in general they take 
the form of: 

function(regressionModel) 

In other words, all we need to do is place the name of our regression model (in this case 
albumSales.3) into the function and execute it. As we did earlier, we can distinguish these 
measures by whether they help us to identify outliers or influential cases: 

• Outliers : Residuals can be obtained with the resid() function, standardized residuals 
with the rstandard() function and studentized residuals with the rstudent() function. 

• Influential cases : Cook’s distances can be obtained with the cooks.distance() func¬ 
tion, DFBeta with the dfbeta() function, DFFit with the dffits() function, hat values 
(leverage) with the hatvalues() function, and the covariance ratio with the covratio() 
function. 

If we merely execute these functions, R will print a long list of values to the console for 
us, which isn’t very useful. Instead, we can store the values in the dataframe, which will 
enable us to look at them more easily. We can store them in the dataframe by simply creat¬ 
ing a new variable within the dataframe and setting the value of this variable to be one of 
the functions we’ve just discussed. Remember from section 3.5.2, that to add a variable to 
a dataframe we execute a command with this general format: 

dataFrameName$newVariableName<-newVariableData 

In other words, we create the variable by specifying a name for it and appending this to 
the name of the dataframe to which we want to add it, then on the right-hand side of the 
command we specify what the variable contains (with some arithmetic or a function, etc.). 
Therefore, to create a variable in our album2 dataframe that contains the residuals for each 
case, we would execute: 

album2$residuals<-resid(albumSales.3) 


12 Statisticians can get quite hot under the collar about the different types of sums of squares. However, if you 
ask for type III sums of squares, you’ll get the same p-values that you get in the model summary. That’s why we 
like them here. 
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This creates a variable called residuals in the dataframe called album2 ( album2$residuals ) 
that contains the residuals from the albumSales.3 model ( resid(albumSales.3)). Similarly, 
we can add all of the other residuals and casewise statistics by executing: 

albumZ$standardized.residuals<- rstandard(albumSales.3) 
albumZ$studentized.residuals<-rstudent(albumSales.3) 
albumZ$cooks.distance<-cooks.distancefalbumSales.3) 
albumZ$dfbeta<-dfbetaCalbumSales.3) 
albumZ$dffit<-dffitsfalbumSales. 3) 
albumZ$leverage<-hatvalues(albumSales.3) 
albumZ$covariance.ratios<-covratio(albumSales.3) 

If you look at the data, you’ll see that as well as the original variables, the dataframe now 
contains variables containing the casewise statistics. For example, if you execute: 

albumZ 

You will see the contents of the dataframe (I’ve edited the column names, suppressed some 
of the columns, and included only the first six rows of data): 13 


adverts 

sales 

airplay 

attract 

resid 

stz . r 

stu. r 

cooks dfbeta 

10.256 

330 

43 

10 

100.080 

2.177 

2.199 

0.059 -5.422 

985.685 

120 

28 

7 - 

108.949 

-2.323 

-2.350 

0.011 

0.216 

1445.563 

360 

35 

7 

68.442 

1.469 

1.473 

0.011 -0.659 

1188.193 

270 

33 

7 

7.024 

0.150 

0.150 

0.000 -0.045 

574.513 

220 

44 

5 

-5.753 

-0.124 

-0.123 

0.000 -0.149 

568.954 

170 

19 

5 

28.905 

0.618 

0.617 

0.001 

1.143 


Having created these new variables it might be a good idea to save the data (see section 
3.8), which we can do by executing: 

write.table(albumZ, "Album Sales With Diagnostics.dat", sep = "\t", row.names 
= FALSE) 

First, let’s look at the standardized residuals. I mentioned in section 7.7.1.1 that in an 
ordinary sample we would expect 95% of cases to have standardized residuals within about 
±2. We have a sample of 200, therefore it is reasonable to expect about 10 cases (5%) to 
have standardized residuals outside these limits. One of the nice things about R is that it 
automatically considers those standardized residuals to be data, so we can examine them 
just like we examine data. For example, if you execute the command: 

albumZ$standardized.residuals > 2 I albumZ$standardized.residuals < -Z 

then R will tell you for every case if the residual is less than -2 or greater than 2 (remember 
that the 1 1 ’ symbol in the command means ‘or’, so the command asks ‘is the standardized 
residual greater than 2 or smaller than -2?’). The command produces the following output: 


TRUE 

TRUE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

TRUE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

TRUE 

FALSE 

FALSE 

FALSE 

FALSE 

TRUE 

FALSE 

FALSE 

TRUE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

TRUE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

TRUE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

FALSE 

etc. 



13 To save space, I wanted the values rounded to 3 decimal places so I executed: 
round(album2, digits = 3) 
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For each case, it tells us whether it is TRUE that they have a residual more than 2 or 
less than —2 (i.e., a large residual), or FALSE, that they do not (i.e., the residual falls 
between ± 2). As before, we can store this information as a new variable in our dataframe 
by executing: 

album2$large. residual<-album2$standardized. residuals>2 I album2$standardized. 
residuals < -2 

Now we have a variable that we can use. To use it, it is useful to remember that R stores 
‘TRUE’ as 1, and ‘FALSE’ as 0. Because of that, we can use the sum() function to get the 
sum of the variable large.residual, and this will be the number of cases with a large residual. 
To use the sum() function we simply enter into it the variable that we want to sum; there¬ 
fore, to find out how many large residuals there are we can execute: 

sum(album2$large.residual) 

[ 1 ] 12 

In other words, R will tell you that only 12 cases had a large residual (which we 
defined as one bigger than 2 or smaller than -2). It might be better to know not just 
how many cases there are, but also which cases they are. We can do this by selecting 
only those cases for which the variable large.residual is TRUE. Remember from sec¬ 
tion 3.9.1 that we can select parts of the data set by using dataFrame[rou>s, columns] 
in which we can specify conditions for rows and columns that tell R what we want to 
see. If we set rows to be album2$large.residual, then we will see only those rows for 
which large.residual is TRUE. If we don’t want to see all of the columns, we could 
also list the columns that we do want to see by providing a list of variable names. For 
example, if we execute: 

album2[album2$large.residual,c("sales", "airplay", "attract", "adverts", 
"standardized.residuals")] 

we will see the variables (or columns) labelled sales, airplay, attract, adverts and standard- 
ized.residuals but only for cases for which large.residual is TRUE. Output 7.6 shows these 
values. From this output we can see that we have 12 cases (6%) that are outside of the lim¬ 
its: therefore, our sample is within 1% of what we would expect. In addition, 99% of cases 
should lie within ±2.5 and so we would expect only 1% of cases to lie outside of these 
limits. From the cases listed here, it is clear that two cases (1%) lie outside of the limits 
(cases 164 and 169). Therefore, our sample appears to conform to what we would expect 
for a fairly accurate model. These diagnostics give us no real cause for concern except that 
case 169 has a standardized residual greater than 3, which is probably large enough for us 
to investigate this case further. 

We have saved a range of other casewise diagnostics from our model. One useful strat¬ 
egy is to use the casewise diagnostics to identify cases that you want to investigate further. 
Let’s continue to look at the diagnostics for the cases of interest. Let’s look now at the 
leverage (hat value), Cook’s distance and covariance ratio for these 12 cases that have large 
residuals. We can do this by using the same command as before, but listing different vari¬ 
ables (columns) in the data set: 

album2[album2$large.residual , c("cooks.distance", "leverage", "covariance, 
ratios")] 

Executing this command prints the variables (or columns) labelled cooks.distance, 
leverage, and covariance.ratios but only for cases for which large.residual is TRUE. 
Output 7.7 shows these values; none of them has a Cook’s distance greater than 1 (even 
case 169 is well below this criterion), so none of the cases is having an undue influence 
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on the model. The average leverage can be calculated as 0.02 (k + 1/n = 4/200) and 
so we are looking for values either twice as large as this (0.04) or three times as large 
(0.06) depending on which statistician you trust most (see section 7.7.1.2). All cases 
are within the boundary of three times the average and only case 1 is close to two times 
the average. 



sales 

airplay 

attract 

adverts 

standardized.residuals 

1 

330 

43 

10 

10.256 

2.177404 

2 

120 

28 

7 

985.685 

-2.323083 

10 

300 

40 

7 

174.093 

2.130289 

47 

40 

25 

8 

102.568 

-2.460996 

52 

190 

12 

4 

405.913 

2.099446 

55 

190 

33 

8 

1542.329 

-2.455913 

61 

300 

30 

7 

579.321 

2.104079 

68 

70 

37 

7 

56.895 

-2.363549 

100 

250 

5 

7 

1000.000 

2.095399 

164 

120 

53 

8 

9.104 

-2.628814 

169 

360 

42 

8 

145.585 

3.093333 

200 

110 

20 

9 

785.694 

-2.088044 


Output 7.6 

cooks.distance leverage covariance.ratios 


1 

0.058703882 

0.047190526 

0.9712750 

2 

0.010889432 

0.008006536 

0.9201832 

10 

0.017756472 

0.015409738 

0.9439200 

47 

0.024115188 

0.015677123 

0.9145800 

52 

0.033159177 

0.029213132 

0.9599533 

55 

0.040415897 

0.026103520 

0.9248580 

61 

0.005948358 

0.005345708 

0.9365377 

68 

0.022288983 

0.015708852 

0.9236983 

100 

0.031364021 

0.027779409 

0.9588774 

164 

0.070765882 

0.039348661 

0.9203731 

169 

0.050867000 

0.020821154 

0.8532470 

200 

0.025134553 

0.022539842 

0.9543502 

Output 

7.7 




There is also a column for the covariance ratio. We saw in section 7.7.1.2 that we need 
to use the following criteria: 

• CVR > 1 + [3(& + 1 )/n\ = 1 + [3(3 + l)/200] = 1.06; 

• CVR < 1 - [3 (k + 1 )/n\ = 1 - [3(3 + l)/200] = 0.94. 

Therefore, we are looking for any cases that deviate substantially from these boundaries. 
Most of our 12 potential outliers have CVR values within or just outside these boundar¬ 
ies. The only case that causes concern is case 169 (again) whose CVR is some way below 
the bottom limit. However, given the Cook’s distance for this case, there is probably little 
cause for alarm. 

You could have requested other diagnostic statistics and from what you know from 
the earlier discussion of them you would be well advised to glance over them in case 
of any unusual cases in the data. However, from this minimal set of diagnostics we 
appear to have a fairly reliable model that has not been unduly influenced by any subset 
of cases. 
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CRAMMING SAM’S TIPS 


Influential cases 


You need to look for cases that might be influencing the regression model: 


Look at standardized residuals and check that no more than 5% of cases have absolute values above 2, and that no more 
than about 1% have absolute values above 2.5. Any case with a value above about 3 could be an outlier. 

Look at the values of Cook's distance: any value above 1 indicates a case that might be influencing the model. 

Calculate the average leverage (the number of predictors plus 1, divided by the sample size) and then look for values 
greater than twice or three times this average value. 

Calculate the upper and lower limit of acceptable values for the covariance ratio, CVR. The upper limit is 1 plus three times 
the average leverage, whereas the lower limit is 1 minus three times the average leverage. Cases that have a CVR falling 
outside these limits may be problematic. 


.9.3. 


Assessing the assumption of independence © 


In section 7.7.2.1 we discovered that we can test the assumption of independent errors 
using the Durbin-Watson test. We can obtain this statistic along with a measure of autocor¬ 
relation and a p-value in R using the durbinWatsonTest() (careful, that’s a lower case d at 
the start, and upper case W and T) or, equivalently, dwt() function. All you need to do is to 
name your regression model within the function and execute it. So, for example, to see the 
Durbin-Watson test for our albumSales.3 model, we would execute: 

durbinWatsonTest(albumSales.3) 

or 

dwt(albumSales. 3) 

both of which do the same thing. As a conservative rule I suggested that values less than 1 
or greater than 3 should definitely raise alarm bells. The closer to 2 that the value is, the 
better, and for these data (Output 7.8) the value is 1.950, which is so close to 2 that the 
assumption has almost certainly been met. The p-value of .7 confirms this conclusion (it is 
very much bigger than .05 and, therefore, not remotely significant). (The p-value is a little 
strange, because it is bootstrapped, and so, for complex reasons that we don’t want to go 
into here, it is not always the same every time you run the command.) 

lag Autocorrelation D-W Statistic p-value 
1 0.0026951 1.949819 0.7 

Alternative hypothesis: rho != 0 

Output 7.8 


Assessing the assumption of no multicollinearity © 


The VIF and tolerance statistics (with tolerance being 1 divided by the VIF) are useful sta¬ 
tistics to assess collinearity. We can obtain the VIF using the vif() function. All we need to do 
is to specify the model name within the function; so, for example, to get the VIF statistics 
for the albumSales.3 model, we execute: 


vif(albumSales.3) 
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The tolerance doesn’t have its own function, but we can calculate it very easily, if we 
remember that tolerance = 1/VIF. Therefore, we can get the values by executing: 

1/vif(albumSales.3) 

It can be useful to look at the average VIF too. To calculate the average VIF we can add the 
VIF values for each predictor and divide by the number of predictors ( k ): 

„ 1f _ S‘.,™ _ 1.015 + 1.043 + 1.038 _ 1032 
k 3 

Alternatively, we can ask R to do it for us by placing the vif command above into the 
meanQ function and executing: 

mean(vif(albumSales.3)) 

vif(albumSales.3) 

adverts airplay attract 
1.014593 1.042504 1.038455 

1/vif(albumSales.3) 

adverts airplay attract 
0.9856172 0.9592287 0.9629695 

mean(vif(albumSales. 3)) 

[1] 1.03185 

Output 7.9 

These statistics are shown in Output 7.9 (the VIF first, then the tolerance, then the mean 
VIF). There are a few guidelines from section 7.7.2.4 that can be applied here: 

• If the largest VIF is greater than 10 then there is cause for concern (Bowerman &C 
O’Connell, 1990; Myers, 1990). 

• If the average VIF is substantially greater than 1 then the regression may be biased 
(Bowerman &c O’Connell, 1990). 

• Tolerance below 0.1 indicates a serious problem. 

• Tolerance below 0.2 indicates a potential problem (Menard, 1995). 

For our current model the VIF values are all well below 10 and the tolerance statistics all 
well above 0.2. Also, the average VIF is very close to 1. Based on these measures we can 
safely conclude that there is no collinearity within our data. 



CRAMMING SAM’S TIPS 


Checking for multicollinearity 


To check for multicollinearity, use the VIF values. If these values are less than 10 then that indicates there probably isn’t cause 
for concern. If you take the average of VIF values, and this average is not substantially greater than 1, then that also indicates 
that there’s no cause for concern. 
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7.9.5. 


Checking assumptions about the residuals © 


As a final stage in the analysis, you should visually check the assumptions that relate to 
the residuals (or errors). For a basic analysis it is worth plotting the standardized residual 
(y-axis) against the predicted value (x-axis), because this plot is useful to determine whether 
the assumptions of random errors and homoscedasticity have been met. If we wanted to 
produce high-quality graphs for publication we would use ggplot2() - see R’s Souls’ Tip 
7.3. However, if we’re just looking at these graphs to check our assumptions, we’ll use the 
simpler (but not as nice) plot() and hist() functions. 

The first useful graph is a plot of fitted values against residuals. This should look like 
a random array of dots evenly dispersed around zero. If this graph funnels out, then the 
chances are that there is heteroscedasticity in the data. If there is any sort of curve in this 
graph then the chances are that the data have violated the assumption of linearity. Figure 
7.15 shows several examples of plots of standardized predicted values against standard¬ 
ized residuals. The top left panel shows a situation in which the assumptions of linearity 
and homoscedasticity have been met. The top right panel shows a similar plot for a data 
set that violates the assumption of homoscedasticity. Note that the points form the shape 
of a funnel so they become more spread out across the graph. This funnel shape is typical 
of heteroscedasticity and indicates increasing variance across the residuals. The bottom 
left panel shows a plot of some data in which there is a non-linear relationship between 
the outcome and the predictor. This pattern is shown up by the residuals. There is a clear 
curvilinear trend in the residuals. Finally, the bottom right panel illustrates a situation in 
which the data not only represent a non-linear relationship, but also show heteroscedas¬ 
ticity. Note first the curved trend in the data, and then also note that at one end of the 
plot the points are very close together whereas at the other end they are widely dispersed. 
When these assumptions have been violated you will not see these exact patterns, but 
hopefully these plots will help you to understand the types of anomalies you should look 
out for. 

It is easy to get this plot in R: we can simply enter the name of the regression model into 
the plot() function. One of the clever things about R is that when you ask it to perform an 
action on something, it looks at what that something is before it decides what to do. For 
example, when we ask R to summarize something, using summary(x), if x is a continuous 
variable it will give the mean, but if x is a factor (categorical) variable, it will give counts. 
And if x is a regression model, it gives the parameters, R 2 , and a couple of other things. The 
same happens when we use plot(). When you specify a regression model in the plotQ func¬ 
tion, R decides that you probably want to see four plots - the first of which is the residuals 
plotted against the fitted values. 

This plot is shown in Figure 7.16; compare this plot to the examples shown in Figure 
7.15. Hopefully, it’s clear to you that the graph for the residuals in our album sales model 
shows a fairly random pattern, which is indicative of a situation in which the assumptions 
of linearity, randomness and homoscedasticity have been met. 

The second plot that is produced by the plot() function is a Q-Q plot, which shows 
up deviations from normality (see Chapter 5). The straight line in this plot represents 
a normal distribution, and the points represent the observed residuals. Therefore, in 
a perfectly normally distributed data set, all points will he on the line. This is pretty 
much what we see for the record sales data (Figure 7.17, left-hand side). However, 
next to the normal probability plot of the record sales data is an example of a plot for 
residuals that deviate from normality. In this plot, the dots are very distant from the 
line (at the extremes), which indicates a deviation from normality (in this particular 
case skew). 
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-2-10 1 2 3 -2-10 

Standardized Residual 


2 3 


Residuals vs Fitted 



Fitted values 

lm(sales ~ adverts + airplay + attract) 


FIGURE 7.15 

Plots of 

predicted (fitted) 
values against 
standardized 
residuals 


FIGURE 7.16 

Plot of residuals 
against predicted 
(fitted) values for 
the album sales 
model 
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Publication-quality plots© 


The model that is produced by lm() is a type of data set, which has variables in it. One of those variables is the 
predicted (or fitted) values for each case. It’s called Med.values, and we can refer to it just like we refer to any 
other variable, by using a $ sign: so, albumSales.3$fitted.values gives us the predicted values of the model 
albumSales.3. We can save these values in our original dataframe just as we did for the other casewise diagnostic 
variables by executing: 

album2$fitted <- albumSales.3$fitted.values 


We now have a new variable, fitted, in our original dataframe that contains the predicted values. We also have 
the studentized residuals stored in the variable studentized.residuals. Using what we learnt in Chapters 4 and 
5, we can therefore create publication-standard plots by using these two variables. For example, we could plot a 
histogram of the studentized residuals by executing (see Chapter 5 for an explanation of this code): 

histogram<-ggplot(album2, aes(studentized.residuals)) + opts(legend.position 
"none") + geom_histogram(aes(y = ..density..), colour = "black", fill = "white") + 
labs(x = "Studentized Residual", y = "Density") 

histogram + stat_function(fun = dnorm, args = list(mean = mean(album2$studentized. 
residuals, na.rm = TRUE), sd = sd(album2$studentized.residuals, na.rm = TRUE)), colour 
= "red", size = 1) 

We could create a 0-0 plot of these values by executing: 

qqplot.resid <- qplot(sample = album2$studentized.residuals, stat="qq") + labs(x = 
"Theoretical Values", y = "Observed Values") 

Finally, we could plot a scatterplot of studentized residuals against predicted values by executing: 
scatter <- ggplot(album2, aes(fitted, studentized.residuals)) 

scatter + geom_point() + geom_smooth(method = "1m", colour = "Blue")+ labs(x = "Fitted 
Values", y = "Studentized Residual") 

The resulting graphs look like this: 



Studentized Residual 



-2 -10 12 

Theoretical Values 



50 100 150 200 250 300 

Fitted Values 


Another useful way to check whether the residuals deviate from a normal distribution 
is to inspect the histogram of the residuals (or the standardized or studentized residuals). 
We can obtain this plot easily using the hist() function. We simply place a variable name 
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Normality Assumed 

Histogram of album2$studentized.residuals 


Non-Normal 

Histogram of Outcome 



album2$studentized.residuals 



FIGURE 7.17 

Histograms 
and Q-Q plots 
of normally 
distributed 
residuals (left- 
hand side) and 
non-normally 
distributed 
residuals (right- 
hand side) 


Normal Q-Q 



Theoretical Quantiles 
lm(sales ~ adverts + airplay + attract) 


Normal Q-Q 



into this function and it will plot us a histogram. We saved the studentized residuals in our 
dataframe earlier so we could enter this variable into the function and execute it: 

hist(album2$studentized.residuals) 

If you haven’t saved the studentized residuals into your dataframe you can generate the 
same plot by entering the rstudent() function that we used earlier directly into histQ: 

hist(rstudent(albumSales.3)) 

Figure 7.17 shows the histogram of the data for the current example (left-hand side). The 
histogram should look like a normal distribution (a bell-shaped curve). For the record com¬ 
pany data, the distribution is roughly normal. Compare this histogram to the non-normal 
histogram next to it and it should be clear that the non-normal distribution is skewed 
(asymmetrical). So, you should look for a distribution that has the same shape as the one 
for the album sales data: any deviation from this shape is a sign of non-normality - the 
greater the deviation, the more non-normally distributed the residuals. For both the histo¬ 
gram and normal Q-Q plots, the non-normal examples are extreme cases and you should 
be aware that the deviations from normality are likely to be subtler. 

We could summarize by saying that the model appears, in most senses, to be both accu¬ 
rate for the sample and generalizable to the population. Therefore, we could conclude 
that in our sample advertising budget and airplay are fairly equally important in predicting 
album sales. Attractiveness of the band is a significant predictor of album sales but is less 
important than the other two predictors (and probably needs verification because of pos¬ 
sible heteroscedasticity). The assumptions seem to have been met and so we can probably 
assume that this model would generalize to any album being released. 
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CRAMMING SAM’S TIPS 


Generalizing your model beyond 
your sample 


You need to check some of the assumptions of regression to make sure your model generalizes beyond your sample: 


• Look at the graph of the standardized residuals plotted against the fitted (predicted) values. If it looks like a random array of 
dots then this is good. If the dots seem to get more or less spread out over the graph (look like a funnel) then this is probably 
a violation of the assumption of homogeneity of variance. If the dots have a pattern to them (i.e., a curved shape) then this 
is probably a violation of the assumption of linearity. If the dots seem to have a pattern and are more spread out at some 
points on the plot than others then this probably reflects violations of both homogeneity of variance and linearity. Any of these 
scenarios puts the validity of your model into question. Repeat the above for all partial plots too. 

• Look at a histogram of the residuals too. If the histogram looks like a normal distribution (and the Q-Q plot looks like a diago¬ 
nal line), then all is well. If the histogram looks non-normal, then things are less good. Be warned, though: distributions can 
look very non-normal in small samples even when they are normal! 


.9.6. 


What if I violate an assumption? (D 


It’s worth remembering that you can have a perfectly good model for your data (no out¬ 
liers, influential cases, etc.) and you can use that model to draw conclusions about your 
sample, even if your assumptions are violated. However, it’s much more interesting to 
generalize your regression model and this is where assumptions become important. If they 
have been violated then you cannot generalize your findings beyond your sample. The 
options for correcting for violated assumptions are a bit limited. If residuals show problems 
with heteroscedasticity or non-normality you could try transforming the raw data - but this 
won’t necessarily affect the residuals. If you have a violation of the linearity assumption 
then you could see whether you can do logistic regression instead (described in the next 
chapter). Finally, you could do a robust regression, and this topic is next on our agenda. 


7.10. Robust regression: bootstrapping ® 


We saw in Section 6.5.7 that we could bootstrap our estimate of a correlation to obtain 
the statistical significance and confidence intervals, and that this meant we could relax the 
distributional assumptions. We can do the same thing with regression estimates. When I 
showed you how to bootstrap correlations, we used the boot package, and we’re going to 
use the same procedure again. 14 

We first encountered bootstrapping and the boot() function in Chapter 6, but it won’t 
hurt to recap. When we use the boot() function, it takes the general form of: 

object<-boot(data, function, replications) 


14 There is a package called simpleboot , which has a function called Im.bootQ. However, at the time of writing, 
although simpleboot is very easy to use to bootstrap, obtaining things like the confidence intervals after you have 
bootstrapped is much harder. 
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in which data specifies the dataframe you want to use, function is the function you want to 
bootstrap, and replications is the number of bootstrap samples you want to use. (More is 
always better, but takes longer - I find 2000 to be a nice compromise.) 

As we did for correlations, we need to write a function (R’s Souls’ Tip 6.2) we want to boot¬ 
strap. We’ll write one called bootRegQ - this function is a little more complex than the function 
we wrote for the correlation, because we are interested in more than one statistic (we have an 
intercept and three slope parameters to bootstrap). The function we need to execute is: 

bootReg <- function (formula, data, indices) 

{ 

d <- data [i,] 

fit <- lm(formula, data = d) 
return(coef(fit)) 

} 


Executing this command creates an object called bootReg. The first bit of the function 
tells R what input to expect in the function: in this case we need to feed into the function 
a regression formula (just like what we’d put into the lm() function, so something like y 
~ a + b), a dataframe, and a variable that has been called i (which refers to a particular 
bootstrap sample, just as it did in the correlation example). Let’s look at what the bits of 
the function do: 

• d <- data / i,J: This creates a dataframe called d, which is a subset of the dataframe 
entered into the function. The i again refers to a particular bootstrap sample. 

• fit <- lm(formula, data = d): This creates a regression model called fit using the lm() 
function (notice that the formula that we enter into the bootReg function is used 
within lm() to generate the model). 

• return(coef(fit)): The return() function, as the name suggests, just determines what 
our function bootReg returns to us. The function coef() is one that extracts the 
coefficients from a regression object; therefore, retum(coef(fit)) means that the 
output of bootReg will be the intercept and any slope coefficients for predictors 
in the model. 

Having created this function (remember to execute the code), we can use the function to 
obtain the bootstrap samples: 

bootResults<-boot(statistic = bootReg, formula = sales ~ adverts + airplay + 
attract, data = album2, R = 2000) 

Executing this command creates an object called bootResults that contains the bootstrap 
samples. We use the boot() function to get these. Within this function we tell it to use the 
function bootRegQ that we just created ( statistic = bootReg)-, because that function requires 
a formula and dataframe, we specify the model as we did for the original model ( formula 
= sales ~ adverts + airplay + attract), and name the dataframe [data = album!). As such, 
everything in the boot() function is something that we specified as being necessary input for 
the bootReg() function when we defined it. The only new thing is R, which sets the num¬ 
ber of bootstrap samples (in this case we have set it to 2000, which means we will throw 
these instructions into the bootReg() function 2000 different times and save the results in 
bootResults each time. 

Instead of one statistic, we need to obtain bootstrap confidence intervals for the inter¬ 
cept, and the three slopes for advert, airplay and attract. We can do this with the boot.ci() 
function that we encountered in Chapter 6. However, R doesn’t know the names of the 
statistics in bootResults, so we instead have to use their location in the bootResults object 
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(because R does know this information). The intercept is the first thing in bootResults, so 
to obtain the bootstrapped confidence interval for the intercept we use index = 1: 

boot.ci(bootResults, type = "bca", index = 1) 

Note that we enter the object from which the confidence intervals will come ( bootResults ), 
and index to tell R where in bootResults to look (index = 1), and specify the type of confi¬ 
dence interval that we want (in this case bias corrected and accelerated, type = “bca”). The 
locations of the coefficients for adverts, airplay and attract are given by index values of 2, 
3 and 4, respectively, so we can get the bootstrap confidence intervals for those predictors 
by executing: 


boot.ci(bootResults, 

type = 

"bca", 

index 

= 2) 

boot.ci(bootResults, 

type = 

"bca", 

index 

= 3) 

boot.ci(bootResults, 

type = 

"bca", 

index 

= 4) 


BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS 
Based on 2000 bootstrap replicates 

CALL : 

boot.ci(boot.out = bootResults, type = "bca", index = 1) 

Intervals : 

Level BCa 

95% (-58.49, 5.17 ) 

Calculations and Intervals on Original Scale 

> boot.ci(bootResults , type="bca", index=2) 

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS 
Based on 2000 bootstrap replicates 

CALL : 

boot.ci(boot.out = bootResults, type = "bca", index = 2) 

Intervals : 

Level BCa 

95% ( 0.0715, 0.0992 ) 

Calculations and Intervals on Original Scale 

> boot.ci(bootResults , type="bca", index=3) 

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS 
Based on 2000 bootstrap replicates 

CALL : 

boot.ci(boot.out = bootResults, type = "bca", index = 3) 

Intervals : 

Level BCa 

95% ( 2.736, 3.980 ) 

Calculations and Intervals on Original Scale 

> boot.ci(bootResults , type="bca", index=4) 

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS 
Based on 2000 bootstrap replicates 
CALL : 

boot.ci(boot.out = bootResults, type = "bca", index = 4) 
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Intervals : 

Level BCa 

95% ( 6.52, 15.32 ) 

Calculations and Intervals on Original Scale 

Output 7.10 

This gives us Output 7.10, which shows the confidence interval for the intercept is from 
-58.49 to 5.17 (remember that because of how bootstrapping works, you won’t get exactly 
the same result as me, but it should be very close). Compare that with the confidence inter¬ 
val we found using the plug-in approach, shown in Output 7.4, which was from -60.83 to 
7.60; the bootstrap results are pretty close. 

The first predictor (but second variable) was adverts. The plug-in approach gave us a 
confidence interval from 0.071 to 0.099; the bootstrap confidence interval is from 0.072 
to 0.099. Next came airplay, which had a plug-in confidence interval from 2.82 to 3.92 
and a bootstrap confidence interval from 2.74 to 3.98. Finally, attract had a plug-in confi¬ 
dence interval from 6.28 to 15.89 and a bootstrap confidence interval from 6.52 to 15.32. 
All of the bootstrap confidence intervals are very close to the plug-in confidence intervals, 
suggesting that we did not have a problem of non-normal distribution in the model. 


7.11. How to report multiple regression © 


If you follow the American Psychological Association guidelines for reporting multiple 
regression then the implication seems to be that tabulated results are the way forward. The 
APA also seem in favour of reporting, as a bare minimum, the standardized betas, their sig¬ 
nificance value and some general statistics about the model (such as the R 2 ). If you do decide 
to do a table then the beta values and their standard errors are also very useful. Personally 
I’d like to see the constant as well because then readers of your work can construct the full 
regression model if they need to. Also, if you’ve done a hierarchical regression you should 
report these values at each stage of the hierarchy. So, basically, you want to reproduce the 
table labelled Estimates from the output and omit some of the non-essential information. 
For the example in this chapter we might produce a table like that in Table 7.2. 

Look back through the output in this chapter and see if you can work out from where 
the values came. Things to note are: (1) I’ve rounded off to 2 decimal places throughout; 
(2) in line with APA convention, I’ve omitted 0 from the probability values, as these cannot 
exceed 1. All other values can, so the 0 is included. 


Table 7.2 How to report multiple regression 
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Labcoat Leni’s Real Research 7.1 


Why do you like your 
lecturers?© 


Chamorro-Premuzic, T, et al. (2008). Personality and Individual Differences, 44, 965-976. 


In the previous chapter we encountered a study by Chamorro-Premuzic et al. in which they measured students’ 
personality characteristics and asked them to rate how much they wanted these same characteristics in their 
lecturers (see Labcoat Leni’s Real Research 6.1 for a full description). In that chapter we correlated these scores; 
however, we could go a step further and see whether students’ personality characteristics predict the character¬ 
istics that they would like to see in their lecturers. 

The data from this study are in the file Chamorro-Premuzic.dat. Labcoat Leni wants you to carry out five 
multiple regression analyses: the outcome variable in each of the five analyses is how much students want to see 
neuroticism, extroversion, openness to experience, agreeableness and conscientiousness. For each 
of these outcomes, force Age and Gender into the analysis in the first step of the hierarchy, then in the 
second block force in the five student personality traits (Neuroticism, Extroversion, Openness to experi¬ 
ence, Agreeableness and Conscientiousness). For each analysis create a table of the results. 

Answers are in the additional material on the companion website (or look at Table 4 in the original article). 



7.12. Categorical predictors and multiple 
regression © 



Often in regression analysis you’ll collect data about groups of people (e.g., ethnic group, 
gender, socio-economic status, diagnostic category). You might want to include these 
groups as predictors in the regression model; however, we saw from our assumptions that 
variables need to be continuous or categorical with only two categories. We saw in sec¬ 
tion 6.5.7 that a point-biserial correlation is Pearson’s r between two variables when one 
is continuous and the other has two categories coded as 0 and 1. We’ve also learnt that 
simple regression is based on Pearson’s r, so it shouldn’t take a great deal of imagination 
to see that, like the point-biserial correlation, we could construct a regression model with 
a predictor that has two categories (e.g., gender). Likewise, it shouldn’t be too inconceiv¬ 
able that we could then extend this model to incorporate several predictors that had two 
categories. All that is important is that we code the two categories with the values of 0 
and 1. Why is it important that there are only two categories and that they’re coded 0 and 
1 ? Actually, I don’t want to get into this here because this chapter is already too long, the 
publishers are going to break my legs if it gets any longer, and I explain it anyway later in 
the book (sections 9.4.2 and 10.2.3), so, for the time being, just trust me! 


7.12.1. 


Dummy coding ® 


The obvious problem with wanting to use categorical variables as predictors is that often 
you’ll have more than two categories. For example, if you’d measured religious affiliation 
you might have categories of Muslim, Jewish, Hindu, Catholic, Buddhist, Protestant, Jedi 
(for those of you not in the UK, we had a census here in 2001 in which a significant portion 
of people put down Jedi as their religion). Clearly these groups cannot be distinguished 
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using a single variable coded with zeros and ones. In these cases we can use what we call 
dummy variables. Dummy coding is a way of representing groups of people using only zeros 
and ones. To do it, we have to create several variables; in fact, the number of variables we 
need is one less than the number of groups we’re recoding. There are eight basic steps: 

1 Count the number of groups you want to recode and subtract 1. 

2 Create as many new variables as the value you calculated in step 1. These are your 
dummy variables. 

3 Choose one of your groups as a baseline (i.e., a group against which all other groups 
should be compared). This should usually be a control group, or, if you don’t have 
a specific hypothesis, it should be the group that represents the majority of people 
(because it might be interesting to compare other groups against the majority). 

4 Having chosen a baseline group, assign that group values of 0 for all of your dummy 
variables. 

5 For your first dummy variable, assign the value 1 to the first group that you want to 
compare against the baseline group. Assign all other groups 0 for this variable. 

6 For the second dummy variable assign the value 1 to the second group that you want 
to compare against the baseline group. Assign all other groups 0 for this variable. 

7 Repeat this until you run out of dummy variables. 

8 Place all of your dummy variables into the regression analysis! 


Let’s try this out using an example. In Chapter 4 we came across an example in which 
a biologist was worried about the potential health effects of music festivals. She collected 
some data at the Download Festival, which is a music festival specializing in heavy metal. 
The biologist was worried that the findings that she had were a function of the fact that she 
had tested only one type of person: metal fans. Perhaps it’s not the festival that makes peo¬ 
ple smelly, maybe it’s only metal fans at festivals that get smellier (as a metal fan, I would at 
this point sacrifice the biologist to Satan for being so prejudiced). Anyway, to answer this 
question she went to another festival that had a more eclectic clientele. The Glastonbury 
Music Festival attracts all sorts of people because many styles of music are performed there. 
Again, she measured the hygiene of concert-goers over the three days of the festival using 
a technique that results in a score ranging between 0 (you smell like you’ve bathed in sew¬ 
age) and 4 (you smell of freshly baked bread). Now, in Chapters 4 and 5, we just looked at 
the distribution of scores for the three days of the festival, but now the biologist wanted to 
look at whether the type of music you like (your cultural group) predicts whether hygiene 
decreases over the festival. The data are in the file called GlastonburyFestivalRegression. 
dat. This file contains the hygiene scores for each of three days of the festival, but it also 
contains a variable called change, which is the change in hygiene over the three days of the 
festival (so it’s the change from day 1 to day 3). 15 Finally, the biologist categorized people 
according to their musical affiliation: if they mainly liked alternative music she called them 
‘indie kid’, if they mainly liked heavy metal she called them a ‘metaller’, and if they mainly 
liked hippy/folky/ambient type music then she labelled them a ‘crusty’. Anyone not falling 
into these categories was labelled ‘no musical affiliation’. 

The first thing we should do is calculate the number of dummy variables. We have four 
groups, so there will be three dummy variables (one less than the number of groups). Next 
we need to choose a baseline group. We’re interested in comparing those who have differ¬ 
ent musical affiliations against those who don’t, so our baseline category will be ‘no musi¬ 
cal affiliation’. We give this group a code of 0 for all of our dummy variables. For our first 



15 Not everyone could be measured on day 3, so there is a change score only for a subset of the original sample. 
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dummy variable, we could look at the ‘crusty’ group, and to do this we give anyone that 
was a crusty a code of 1, and everyone else a code of 0. For our second dummy variable, 
we could look at the ‘metaller’ group, and to do this we give anyone that was a metaller 
a code of 1, and everyone else a code of 0. We have one dummy variable left and this will 
have to look at our final category: ‘indie kid’; we give anyone who was an indie kid a code 
of 1, and everyone else a code of 0. The resulting coding scheme is shown in Table 7.3. The 
thing to note is that each group has a code of 1 on only one of the dummy variables (except 
the base category that is always coded as 0). 


Table 7.3 Dummy coding for the Glastonbury Festival data 



Dummy Variable 1 

Dummy Variable 2 

Dummy Variable 3 

Crusty 

1 

0 

0 

Indie Kid 

0 

i 

0 

Metaller 

0 

0 

1 

No Affiliation 

0 

0 

0 


This being R, there are several ways to code dummy variables. We’re going to have a 
look at the contrasts() function, because we will use it time and time again later in the book. 
First let’s load the data by executing: 

gfr<-read.delim(file="GlastonburyFestivalRegression.dat", header = TRUE) 

This creates a dataframe called gfr (because we didn’t want to have to keep typing 
glastonburyFestivalRegression). These data look like this (the first 10 cases only): 



ticknumb 



music 

dayl 

day2 

day3 

change 

1 

2111 



Metaller 

2.65 

1.35 

1.61 

-1.04 

2 

2229 



Crusty 

0.97 

1.41 

0.29 

-0.68 

3 

2338 

No 

Musical 

Affiliation 

0.84 

NA 

NA 

NA 

4 

2384 



Crusty 

3.03 

NA 

NA 

NA 

5 

2401 

No 

Musical 

Affiliation 

0.88 

0.08 

NA 

NA 

6 

2405 



Crusty 

0.85 

NA 

NA 

NA 

7 

2467 



Indie Kid 

1.56 

NA 

NA 

NA 

8 

2478 



Indie Kid 

3.02 

NA 

NA 

NA 

9 

2490 



Crusty 

2.29 

NA 

NA 

NA 

10 

2504 

No 

Musical 

Affiliation 

1.11 

0.44 

0.55 

-0.56 


Note that the variable music contains text; therefore, R has intelligently decided to cre¬ 
ate this variable as a factor, and treat the levels in alphabetical order (level 1 = crusty, level 2 
= indie kid, 3 = metaller, and 4 = no musical affiliation). We can use the contrast() function 
on this variable to set contrasts because it is a factor. There are several built-in contrasts 
that we can set (these are described in Chapter 10, Table 10.6, when we get into this topic 
in more detail). For now, all I’ll say is that in a situation in which we want to compare all 
groups to a baseline we can execute this command: 

contrasts(gfr$music)<-contr.treatment(4, base = 4) 

The contrasts(gfr$music) simply sets the contrast for the variable music in the gfr dataframe. 
The contr.treatment() function sets a contrast based on comparing all groups to a baseline 
(a.k.a. treatment) condition. This function takes the general form: 

contr.treatment(number of groups, base = number representing the baseline 
group) 
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Therefore, in our command we have told R that there are four groups, and to use the last 
group as the baseline condition. We can see what this command has done by looking at the 
music variable by executing: 

gfr$music 

Note that it now has a contrasts attribute: 
attr(,"contrasts") 

12 3 

Crusty 100 

Indie Kid 010 

Metaller 001 

No Musical Affiliation 000 

Levels: Crusty Indie Kid Metaller No Musical Affiliation 

You can see three contrasts that related to those that we discussed in Table 7.3. This method 
is the quickest way, but personally I prefer to set the contrasts manually. This preference is 
not masochism, but because you can then give your contrasts informative names (there is 
nothing worse than seeing an output with ‘contrastl’ in it and having no idea what contrast 
1 is). To do this, we create variables that reflect each of the dummy variables in Table 7.3: 

crusty_v_NMA<-c(l, 0, 0, 0) 
indie_v_NMA<-c(0, 1, 0, 0) 
metal_v_NMA<-c(0, 0, 1, 0) 

We have created three variables, the first (crusty_v_NMA) contains the codes for the first dummy 
variable. Note that we have listed the codes in the order of the factor levels for music (so, the 
first group, crusty, gets a code of 1, the others a code of 0) and given it a name that reflects what 
it compares (crusty vs. no musical affiliation); therefore, when we see it in the output we will 
know what it represents. Similarly, the second variable (indie _v_HMA) contains the codes for 
the second dummy variable. Again we list the codes in the order of the factor levels for music 
(so, the second group, indie kid, gets a code of 1, the others a code of 0). You get the idea. 

Having created the dummy variables, we can bind them together using cbindQ - see R’s 
Souls’ Tip 3.5 - and set them as the contrasts in a similar way to before, by executing: 

contrasts(gfr$music)<-cbind(crusty_v_NMA, indie_v_NMA, metal_v_NMA) 

When we inspect the music variable now, it again has the same contrasts, but they have 
more helpful names than before: 

attr(,"contrasts") 

crusty_v_NMA indie_v_NMA metal_v_NMA 
Crusty 100 

Indie Kid 010 

Metaller 001 

No Musical Affiliation 000 

Levels: Crusty Indie Kid Metaller No Musical Affiliation 


7.12.2. 


Regression with dummy variables (D 


Now you’ve created the dummy variables, you can run the regression in the same way as 
for any other kind of regression, by executing: 

glastonburyModel<-lm(change ~ music, data = gfr) 

summary(glastonburyModel) 
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The fact that we have set contrasts for the variable music means that the three dummy 
variables will be entered into the model in a single step. This is one reason why setting 
contrasts is a useful way to handle categorical predictors (but see R’s Souls’ Tip 7.4). 



A small confession © 


Before we move on, I have a small confession to make. R will actually dummy-code your data for you. When you 
put a variable as a predictor into a regression, R looks to see what kind of variable it is. Most variables in R are 
considered ‘numeric’. But some variables are considered to be ‘factors’ - a factor is a variable that R knows to 
be categorical. I mentioned that when we loaded the data R would intelligently create the variable music as a 
factor. If you enter a string variable or a factor into a regression equation, R knows that it is categorical, and so will 
dummy-code it for you. So, you can skip all the dummy coding nonsense and simply execute: 

ImCchange ~ music, data = gfr) 


So why didn’t I tell you that to start with? There are three reasons. First, to interpret the results you need to 
understand what R is doing. Second, we often want to decide what category is going to be the reference cat¬ 
egory when we create the dummy variables, based on the meaning of the data. R doesn’t know what the data 
mean (it’s not that clever), so it chooses the first group to be the reference (in this case it would have chosen 
crusty, which was not what we want). Finally, and I know I keep going on about this, if we set our contrasts manu¬ 
ally we can give them helpful names. 


Call: 

lm(formula = change ~ music, data = gfr) 


Residuals: 

Min IQ Median 3Q Max 

-1.82569 -0.50489 0.05593 0.42430 1.59431 


Coefficients: 


Estimate Std. Error t value 


(Intercept) 
musiccrusty_v_NMA 
musicindie_v_NMA 
musierne t al_v_NMA 


-0.55431 

-0.41152 

-0.40998 

0.02838 


0.09036 -6.134 
0.16703 -2.464 
0.20492 -2.001 
0.16033 0.177 


Pr(>|t|) 
1.15e-08 
0.0152 
0.0477 
0.8598 


* * * 
* 


* 


Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 0.1 ' ' 1 


Residual standard error: 0.6882 on 119 degrees of freedom 
(687 observations deleted due to missingness) 

Multiple R-squared: 0.07617, Adjusted R-squared: 0.05288 

F-statistic: 3.27 on 3 and 119 DF, p-value: 0.02369 

Output 7.11 

Output 7.11 shows the summary of the regression model (it also shows the command 
that you run to get the model). This shows that by entering the three dummy variables we 
can explain 7.6% of the variance in the change in hygiene scores ( R 2 expressed as a percent¬ 
age). In other words, 7.6% of the variance in the change in hygiene can be explained by 
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the musical affiliation of the person. The F-statistic (which shows the same thing as the R 2 
change statistic because there is only one step in this regression) tells us that the model is 
significantly better at predicting the change in hygiene scores than having no model (or, put 
another way, the 7.6% of variance that can be explained is a significant amount). Most of 
this should be clear from what you’ve read in this chapter already; what’s more interesting 
is how we interpret the individual dummy variables. 

Let’s look at the coefficients for the dummy variables. The first thing to notice is that 
each dummy variable appears in the table with its name (because we named them, oth¬ 
erwise they’d be called something unhelpful like contrastl). The first dummy variable 
(crusty_v_NMA) shows the difference between the change in hygiene scores for the no- 
affiliation group and the crusty group. Remember that the beta value tells us the change in 
the outcome due to a unit change in the predictor. In this case, a unit change in the predic¬ 
tor is the change from 0 to 1. As such it shows the shift in the change in hygiene scores that 
results from the dummy variable changing from 0 to 1. By including all three dummy vari¬ 
ables at the same time, our baseline category is always zero, so this actually represents the 
difference in the change in hygiene scores if a person has no musical affiliation, compared 
to someone who is a crusty. This difference is the difference between the two group means. 

To illustrate this fact, I’ve produced a table (Output 7.12) of the group means for each 
of the four groups by executing this command: 

round(tapply(gfr$change, gfr$music, mean, na.rm=TRUE), 3) 

These means represent the average change in hygiene scores for the three groups (i.e., 
the mean of each group on our outcome variable). If we calculate the difference in these 
means for the no-affiliation group and the crusty group, we get crusty — no affiliation = 
( — 0.966) — (—0.554) = —0.412. In other words, the change in hygiene scores is greater 
for the crusty group than it is for the no-affiliation group (crushes’ hygiene decreases 
more over the festival than those with no musical affiliation). This value is the same as the 
regression estimate value in Output 7.11. So, the beta values tell us the relative difference 
between each group and the group that we chose as a baseline category. This beta value is 
converted to a 7-statistic and the significance of this 7 reported. This 7-statistic is testing, 
as we’ve seen before, whether the beta value is 0, and when we have two categories coded 
with 0 and 1, that means it’s testing whether the difference between group means is 0. If 
it is significant then it means that the group coded with 1 is significantly different from 
the baseline category - so, it’s testing the difference between two means, which is the 
context in which students are most familiar with the 7-statistic (see Chapter 9). For our 
first dummy variable, the 7-test is significant, and the beta value has a negative value so we 
could say that the change in hygiene scores goes down as a person changes from having no 
affiliation to being a crusty. Bear in mind that a decrease in hygiene scores represents more 
change (you’re becoming smellier) so what this actually means is that hygiene decreased 
significantly more in crusties compared to those with no musical affiliation. 

Crusty Indie Kid Metaller No Musical Affiliation 

-0.966 -0.964 -0.526 -0.554 

Output 7.12 

For the second dummy variable (indie_v_NMA), we’re comparing indie kids to those 
that have no musical affiliation. The beta value again represents the shift in the change in 
hygiene scores if a person has no musical affiliation, compared to someone who is an indie 
kid. If we calculate the difference in the group means for the no-affiliation group and the 
indie kid group, we get indie kid — no affiliation = ( — 0.964) — ( — 0.554) = —0.410. It 
should be no surprise to you by now that this is the unstandardized beta value in Output 
7.11. The 7-test is significant, and the beta value has a negative value so, as with the first 
dummy variable, we could say that the change in hygiene scores goes down as a person 
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changes from having no affiliation to being an indie kid. Bear in mind that a decrease in 
hygiene scores represents more change (you’re becoming smellier) so what this actually 
means is that hygiene decreased significantly more in indie kids compared to those with no 
musical affiliation. 

Moving on to our final dummy variable (metal_v_NMA), this compares metallers to 
those that have no musical affiliation. The beta value again represents the shift in the 
change in hygiene scores if a person has no musical affiliation, compared to someone who 
is a metaller. If we calculate the difference in the group means for the no affiliation group 
and the metaller group, we get metaller — no affiliation = ( — 0.526) — ( — 0.554) = 0.028. 
This value is again the same as the unstandardized beta value in Output 7.11. For this last 
dummy variable, the t-test is not significant, so we could say that the change in hygiene 
scores is the same if a person changes from having no affiliation to being a metaller. In 
other words, the change in hygiene scores is not predicted by whether someone is a met¬ 
aller compared to if they have no musical affiliation. 

So, overall this analysis has shown that, compared to having no musical affiliation, crust- 
ies and indie kids get significantly smellier across the three days of the festival, but met¬ 
allers don’t. 

This section has introduced some really complex ideas that I expand upon in Chapters 9 
and 10. It might all be a bit much to take in, and so if you’re confused or want to know more 
about why dummy coding works in this way, I suggest reading sections 9.4.2 and 10.2.3 
and then coming back here. Alternatively, read Hardy’s (1993) excellent monograph! 



What have I discovered about statistics? © 


This chapter is possibly the longest book chapter ever written, and if you feel like you 
aged several years while reading it then, well, you probably have (look around, there are 
cobwebs in the room, you have a long beard, and when you go outside you’ll discover 
a second ice age has been and gone, leaving only you and a few woolly mammoths to 
populate the planet). However, on the plus side, you now know more or less everything 
you ever need to know about statistics. Really, it’s true; you’ll discover in the coming 
chapters that everything else we discuss is basically a variation on the theme of regres¬ 
sion. So, although you may be near death having spent your life reading this chapter 
(and I’m certainly near death having written it) you are a stats genius - it’s official! 

We started the chapter by discovering that at 8 years old I could have really done 
with regression analysis to tell me which variables are important in predicting talent 
competition success. Unfortunately I didn’t have regression, but fortunately I had my 
dad instead (and he’s better than regression). We then looked at how we could use sta¬ 
tistical models to make similar predictions by looking at the case of when you have one 
predictor and one outcome. This allowed us to look at some basic principles such as the 
equation of a straight line, the method of least squares, and how to assess how well our 
model fits the data using some important quantities that you’ll come across in future 
chapters: the model sum of squares, SS M , the residual sum of squares, SS R , and the total 
sum of squares, SS r We used these values to calculate several important statistics such 
as R 2 and the T-ratio. We also learnt how to do a regression using R, and how we can 
plug the resulting beta values into the equation of a straight line to make predictions 
about our outcome. 

Next, we saw that the question of a straight line can be extended to include several 
predictors and looked at different methods of placing these predictors in the model 
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(hierarchical, forced entry, stepwise). Then we looked at factors that can affect the accu¬ 
racy of a model (outliers and influential cases) and ways to identify these factors. We 
then moved on to look at the assumptions necessary to generalize our model beyond the 
sample of data we’ve collected before discovering how to do the analysis using R, and 
how to interpret the output, create our multiple regression model and test its reliability 
and generalizability. I finished the chapter by looking at how we can use categorical 
predictors in regression. In general, multiple regression is a long process and should be 
done with care and attention to detail. There are a lot of important things to consider 
and you should approach the analysis in a systematic fashion. I hope this chapter helps 
you to do that! 

So, I was starting to get a taste for the rock-idol lifestyle: I had friends, a fortune (well, 
two gold-plated winner’s medals), fast cars (a bike) and dodgy-looking 8-year-olds were 
giving me suitcases full of lemon sherbet to lick off mirrors. However, my parents and 
teachers were about to impress reality upon my young mind ... 


R packages used in this chapter 


boot 

QuantPsyc 

car 


R functions used in this chapter 

anova() 

lm() 

confint() 

lm.beta() 

contrasts!) 

mean() 

contr.treatment() 

plot() 

cooks. distanceO 

resid() 

covratio() 

return() 

coef() 

rstandard() 

dfbeta() 

rstudent() 

dffits() 

sqrt() 

durbinWatsonTest() 

sum() 

dwt() 

summaryO 

hatvalues() 

update() 

hist() 

vif() 

Key terms that I’ve discovered 

Adjusted predicted value 

b i 

Adjusted ft 2 

A 

Akaike information criterion (AIC) 

Cook’s distance 

Autocorrelation 

Covariance ratio 
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Cross-validation 

Perfect collinearity 

Deleted residual 

Regression coefficient 

DFBeta 

Regression model 

DFFit 

Residual 

Dummy variables 

Residual sum of squares 

Durbin-Watson test 

Shrinkage 

F- ratio 

Simple regression 

Generalization 

Standardized DFBeta 

Goodness of fit 

Standardized DFFit 

Flat values 

Standardized residuals 

Fleteroscedasticity 

Stepwise regression 

Hierarchical regression 

Studentized deleted residuals 

Homoscedasticity 

Studentized residuals 

Independent errors 

Suppressor effects 

Leverage 

f-statistic 

Mean squares 

Tolerance 

Model sum of squares 

Total sum of squares 

Multicollinearity 

Unstandardized residuals 

Multiple R 2 

Variance inflation factor 

Multiple regression 



Smart Alex’s tasks 



• Task 1: Run a simple regression for the pubs.dat data in Jane Superbrain Box 7.1, 
predicting mortality from number of pubs. Try repeating the analysis but bootstrap¬ 
ping the regression parameters. © 

• Task 2: A fashion student was interested in factors that predicted the salaries of cat- 
walk models. She collected data from 231 models. For each model she asked them 
their salary per day on days when they were working (salary), their age (age), how 
many years they had worked as a model (years), and then got a panel of experts 
from modelling agencies to rate the attractiveness of each model as a percentage, 
with 100% being perfectly attractive (beauty). The data are in the file Supermodel, 
dat. Unfortunately, this fashion student bought some substandard statistics text and 
so doesn’t know how to analyse her data.© Can you help her out by conducting a 
multiple regression to see which variables predict a model’s salary? How valid is the 
regression model? © 

• Task 3: Using the Glastonbury data from this chapter, which you should’ve already 
analysed, comment on whether you think the model is reliable and generalizable. © 

• Task 4: A study was carried out to explore the relationship between Aggression and 
several potential predicting factors in 666 children who had an older sibling. Variables 
measured were Parenting_Style (high score = bad parenting practices), Computer_ 
Games (high score = more time spent playing computer games), Television (high score 
= more time spent watching television), Diet (high score = the child has a good diet 
low in additives), and Sibling_Aggression (high score = more aggression seen in their 
older sibling). Past research indicated that parenting style and sibling aggression were 
good predictors of the level of aggression in the younger child. All other variables 





CHAPTER 7 REGRESSION 


311 


were treated in an exploratory fashion. The data are in the file ChildAggression.dat. 
Analyse them with multiple regression. © 

Answers can be found on the companion website. 



Further reading 


Bowerman, B. L., & O’Connell, R. T. (1990). Linear statistical models: An applied approach (2nd 
ed.). Belmont, CA: Duxbury. (This text is only for the mathematically minded or postgraduate 
students but provides an extremely thorough exposition of regression analysis.) 

Hardy, M. A. (1993). Regression with dummy variables. Sage University Paper Series on Quantitative 
Applications in the Social Sciences, 07-093. Newbury Park, CA: Sage. 

Howell, D. C. (2006). Statistical methods for psychology (6th ed.). Belmont, CA: Duxbury (Or you 
might prefer his Fundamental Statistics for the Behavioral Sciences, also in its 6th edition, 2007. 
Both are excellent introductions to the mathematics behind regression analysis.) 

Miles, J. N. V & Shevlin, M. (2001). Applying regression and correlation: A guide for students and 
researchers. London: Sage. (This is an extremely readable text that covers regression in loads of 
detail but with minimum pain - highly recommended.) 

Stevens, J. (2002). Applied multivariate statistics for the social sciences (4th ed.). Hillsdale, NJ: 
Erlbaum. Chapter 3. 


Interesting real research 


Chamorro-Premuzic, T., Furnham, A., Christopher, A. N., Garwood, J., & Martin, N. (2008). Birds 
of a feather: Students’ preferences for lecturers’ personalities as predicted by their own personal¬ 
ity and learning approaches. Personality and Individual Differences, 44, 965-976. 






Logistic regression 



FIGURE 8.1 

Practising for my 
career as a rock 
star by slaying 
the baying throng 
of Grove Primary 
School at the age 
of 10. (Note the 
girl with her hands 
covering her ears.) 



8.1. What will this chapter tell me? © 


We saw in the previous chapter that I had successfully conquered the holiday camps of 
Wales with my singing and guitar playing (and the Welsh know a thing or two about good 
singing). I had jumped on a snowboard called oblivion and thrown myself down the black 
run known as world domination. About 10 metres after starting this slippery descent I 
hit the lumpy patch of ice called ‘adults’. I was 9, life was fun, and yet every adult that I 
seemed to encounter was obsessed with my future. ‘What do you want to be when you 
grow up?’ they would ask. I was 9 and ‘grown-up’ was a lifetime away; all I knew was that I 
was going to marry Clair Sparks (more about her in the next chapter) and that I was a rock 
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legend who didn’t need to worry about such adult matters as having a job. It was a difficult 
question, but adults require answers and I wasn’t going to let them know that I didn’t care 
about ‘grown-up’ matters. We saw in the previous chapter that we can use regression to 
predict future outcomes based on past data, when the outcome is a continuous variable, but 
this question had a categorical outcome (e.g., would I be a fireman, a doctor, an evil dicta¬ 
tor?). Luckily, though, we can use an extension of regression called logistic regression to 
deal with these situations. What a result; bring on the rabid wolves of categorical data. To 
make a prediction about a categorical outcome, then, as with regression, I needed to draw 
on past data: I hadn’t tried conducting brain surgery, neither had I experience of senten¬ 
cing psychopaths to prison sentences for eating their husbands, nor had I taught anyone. I 
had, however, had a go at singing and playing guitar; ‘I’m going to be a rock star’ was my 
prediction. A prediction can be accurate (which would mean that I am a rock star) or it can 
be inaccurate (which would mean that I’m writing a statistics textbook). This chapter looks 
at the theory and application of logistic regression, an extension of regression that allows 
us to predict categorical outcomes based on predictor variables. 


8.2. Background to logistic regression © 


In a nutshell, logistic regression is multiple regression but with an outcome variable that is a 
categorical variable and predictor variables that are continuous or categorical. In its simplest 
form, this means that we can predict which of two categories a person is likely to belong 
to given certain other information. A trivial example is to look at which variables predict 
whether a person is male or female. We might measure laziness, pig-headedness, alcohol 
consumption and number of burps that a person does in a day. Using logistic regression, we 
might find that all of these variables predict the gender of the person, but the technique will 
also allow us to predict whether a person, not in our original data set, is likely to be male 
or female. So, if we picked a random person and discovered they scored highly on laziness, 
pig-headedness, alcohol consumption and the number of burps, then the regression model 
might tell us that, based on this information, this person is likely to be male. Admittedly, it 
is unlikely that a researcher would ever be interested in the relationship between flatulence 
and gender (it is probably too well established by experience to warrant research), but logis¬ 
tic regression can have life-saving applications. In medical research logistic regression is used 
to generate models from which predictions can be made about the likelihood that a tumour 
is cancerous or benign (for example). A database of patients can be used to establish which 
variables are influential in predicting the malignancy of a tumour. These variables can then 
be measured for a new patient and their values placed in a logistic regression model, from 
which a probability of malignancy could be estimated. If the probability value of the tumour 
being malignant is suitably low then the doctor may decide not to carry out expensive and 
painful surgery that in all likelihood is unnecessary. We might not face such life-threatening 
decisions but logistic regression can nevertheless be a very useful tool. When we are trying 
to predict membership of only two categorical outcomes the analysis is known as binary 
logistic regression, but when we want to predict membership of more than two categories 
we use multinomial (or polychotomous) logistic regression. 


8.3. What are the principles behind logistic 
regression? ® 


I don’t wish to dwell on the underlying principles of logistic regression because they aren’t 
necessary to understand the test (I am living proof of this fact). However, I do wish to draw 
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a few parallels to normal regression so that you can get the gist of what’s going on using 
a framework that will be familiar to you already (what do you mean you haven’t read the 
regression chapter yet!). To keep things simple I’m going to explain binary logistic regres¬ 
sion, but most of the principles extend easily to when there are more than two outcome 
categories. Now would be a good time for those with equation-phobia to look away. In 
simple linear regression, we saw that the outcome variable Y is predicted from the equation 
of a straight line: 

Y = b a + b,X u + e, (8.1) 


in which b Q is the Y intercept, b 1 is the gradient of the straight line, X x is the value of the 
predictor variable and 8 is a residual term. Given the values of Y and X p the unknown 
parameters in the equation can be estimated by finding a solution for which the squared 
distance between the observed and predicted values of the dependent variable is minimized 
(the method of least squares). 

This stuff should all be pretty familiar by now. In multiple regression, in which there are 
several predictors, a similar equation is derived in which each predictor has its own coef¬ 
ficient. As such, Y is predicted from a combination of each predictor variable multiplied by 
its respective regression coefficient: 


Y i -b 0 + b,X u + b 2 X 2i +... + b n X m + 8,- (8.2) 

in which b is the regression coefficient of the corresponding variable X . In logistic regres¬ 
sion, instead of predicting the value of a variable Y from a predictor variable X : or several 
predictor variables (Xs), we predict the probability of Y occurring given known values of X : 
(orXs). The logistic regression equation bears many similarities to the regression equations 
just described. In its simplest form, when there is only one predictor variable X p the logistic 
regression equation from which the probability of Y is predicted is given by: 


P(Y) 


1 

^ _|_ e ~( b o +b i x u) 


(8.3) 


in which P(Y) is the probability of Y occurring, e is the base of natural logarithms, and the 
other coefficients form a linear combination much the same as in simple regression. In 
fact, you might notice that the bracketed portion of the equation is identical to the linear 
regression equation in that there is a constant (b Q ), a predictor variable (XJ and a coeffi¬ 
cient (or weight) attached to that predictor (bj. Just like linear regression, it is possible to 
extend this equation so as to include several predictors. When there are several predictors 
the equation becomes: 


P(Y) 


1 

j g-(K+ b l X li +b l X 2i + —K X m) 


(8.4) 


Equation (8.4) is the same as the equation used when there is only one predictor except 
that the linear combination has been extended to include any number of predictors. So, 
whereas the one-predictor version of the logistic regression equation contained the simple 
linear regression equation within it, the multiple-predictor version contains the multiple 
regression equation. 

Despite the similarities between linear regression and logistic regression, there is a good 
reason why we cannot apply linear regression directly to a situation in which the outcome 
variable is categorical. The reason is that one of the assumptions of linear regression is 
that the relationship between variables is linear. We saw in section 7.7.2.1 how important 
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it is that the assumptions of a model are met for it to be accurate. Therefore, 
for linear regression to be a valid model, the observed data should contain a 
linear relationship. When the outcome variable is categorical, this assumption 
is violated (Berry, 1993). One way around this problem is to transform the data 
using the logarithmic transformation (see Berry & Feldman, 1985, and Chapter 
5). This transformation is a way of expressing a non-linear relationship in a lin¬ 
ear way. The logistic regression equation described above is based on this prin¬ 
ciple: it expresses the multiple linear regression equation in logarithmic terms 
(called the logit) and thus overcomes the problem of violating the assumption 
of linearity. 

The exact form of the equation can be arranged in several ways but the ver¬ 
sion I have chosen expresses the equation in terms of the probability of Y occurring (i.e., 
the probability that a case belongs in a certain category). The resulting value from the equa¬ 
tion, therefore, varies between 0 and 1. A value close to 0 means that Y is very unlikely to 
have occurred, and a value close to 1 means that Y is very likely to have occurred. Also, just 
like linear regression, each predictor variable in the logistic regression equation has its own 
coefficient. When we run the analysis we need to estimate the value of these coefficients so 
that we can solve the equation. These parameters are estimated by fitting models, based on 
the available predictors, to the observed data. The chosen model will be the one that, when 
values of the predictor variables are placed in it, results in values of Y closest to the observed 
values. Specifically, the values of the parameters are estimated using maximum-likelihood esti¬ 
mation, which selects coefficients that make the observed values most likely to have occurred. 
So, as with multiple regression, we try to fit a model to our data that allows us to estimate 
values of the outcome variable from known values of the predictor variable or variables. 



8 . 3 . 1 . 


Assessing the model: the log-likelihood statistic (D 


We’ve seen that the logistic regression model predicts the probability of an event occurring 
for a given person (we would denote this as P(Y), the probability that Y occurs for the zth 
person), based on observations of whether or not the event did occur for that person (we 
could denote this as Y, the actual outcome for the zth person). So, for a given person, Y 
will be either 0 (the outcome didn’t occur) or 1 (the outcome did occur), and the predicted 
value, P(Y), will be a value between 0 (there is no chance that the outcome will occur) and 
1 (the outcome will certainly occur). We saw in multiple regression that if we want to assess 
whether a model fits the data we can compare the observed and predicted values of the 
outcome (if you remember, we use R 2 , which is the Pearson correlation between observed 
values of the outcome and the values predicted by the regression model). Likewise, in logis¬ 
tic regression, we can use the observed and predicted values to assess the fit of the model. 
The measure we use is the log-likelihood: 

log-likelihood = ^[^/«(P(Y,-))+(1 - Y,-)/zz(l - P(Y)-))] ( 8-5 ) 

i=i 

The log-likelihood is based on summing the probabilities associated with the predicted 
and actual outcomes (Tabachnick & Fidell, 2007). The log-likelihood statistic is analogous 
to the residual sum of squares in multiple regression in the sense that it is an indicator of 
how much unexplained information there is after the model has been fitted. It, therefore, 
follows that large values of the log-likelihood statistic indicate poorly fitting statistical 
models, because the larger the value of the log-likelihood, the more unexplained observa¬ 
tions there are. 
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8 . 3 . 2 . 


Assessing the model: the deviance statistic (D 


The deviance is very closely related to the log-likelihood: it’s given by 
deviance = -2 x log-likelihood 

The deviance is often referred to as -2 LL because of the way it is calculated. It’s actu¬ 
ally rather convenient to (almost) always use the deviance rather than the log-likelihood 
because it has a chi-square distribution (see Chapter 18 and the Appendix), which makes it 
easy to calculate the significance of the value. 

Now, it’s possible to calculate a log-likelihood or deviance for different models and to 
compare these models by looking at the difference between their deviances. One use of this 
is to compare the state of a logistic regression model against some kind of baseline state. 
The baseline state that’s usually used is the model when only the constant is included. In 
multiple regression, the baseline model we use is the mean of all scores (this is our best 
guess of the outcome when we have no other information). In logistic regression, if we 
want to predict the outcome, what would our best guess be? Well, we can’t use the mean 
score because our outcome is made of zeros and ones and so the mean is meaningless. 
However, if we know the frequency of zeros and ones, then the best guess will be the cat¬ 
egory with the largest number of cases. So, if the outcome occurs 107 times, and doesn’t 
occur only 72 times, then our best guess of the outcome will be that it occurs (because it 
occurs more often than it doesn’t). As such, like multiple regression, our baseline model is 
the model that gives us the best prediction when we know nothing other than the values 
of the outcome: in logistic regression this will be to predict the outcome that occurs most 
often - that is, the logistic regression model when only the constant is included. If we then 
add one or more predictors to the model, we can compute the improvement of the model 
as follows: 

X 2 = (-2LL (baseline)) — (—2LL (new)) 

= ILL (new) - ILL (baseline) ^ ^ 

^ new ^ baseline 


So, we merely take the new model deviance and subtract from it the deviance for the baseline 
model (the model when only the constant is included). This difference is known as a likeli¬ 
hood ratio, 1 and has a chi-square distribution with degrees of freedom equal to the number of 
parameters, k, in the new model minus the number of parameters in the baseline model. The 
number of parameters in the baseline model will always be 1 (the constant is the only parameter 
to be estimated); any subsequent model will have degrees of freedom equal to the number of 
predictors plus 1 (i.e., the number of predictors plus one parameter representing the constant). 


8 . 3 . 3 . 


Assessing the model: R and R 2 (D 


When we talked about linear regression, we saw that the multiple correlation coefficient R 
and the corresponding R 2 were useful measures of how well the model fits the data. We’ve 


1 You might wonder why it is called a ‘ratio’ when a ‘ratio’ usually means something is divided by something else, 
and we’re not dividing anything here: we’re subtracting. The reason is that if you subtract logs of numbers, it’s 
the same as dividing the numbers. For example, 10/5 =2 and (try it on your calculator) log(10) - log(5) = log(2) 
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also just seen that the likelihood ratio is similar in that it is based on the level of 
correspondence between predicted and actual values of the outcome. However, 
you can calculate a more literal version of the multiple correlation in logistic 
regression known as the ^-statistic. This ^-statistic is the partial correlation 
between the outcome variable and each of the predictor variables, and it can 
vary between —1 and 1. A positive value indicates that as the predictor vari¬ 
able increases, so does the likelihood of the event occurring. A negative value 
implies that as the predictor variable increases, the likelihood of the outcome 
occurring decreases. If a variable has a small value of R then it contributes only 
a small amount to the model. 

The equation for R is: 



R = 


z 2 -ldf 


I -ILL (baseline) 


(8.7) 


The —2LL term is the deviance for the baseline model, z 2 is the Wald statistic calculated 
as described in section 8.3.5, and the degrees of freedom can be read from the summary 
table for the variables in the equation. However, because this value of R is dependent upon 
the Wald statistic it is by no means an accurate measure (we’ll see in section 8.3.5 that the 
Wald statistic can be inaccurate under certain circumstances). For this reason the value of 
R should be treated with some caution, and it is invalid to square this value and interpret 
it as you would in linear regression. 

There is some controversy over what would make a good analogue to the R 2 in linear 
regression, but one measure described by Hosmer and Lemeshow (1989) can be easily cal¬ 
culated. Hosmer and Lemeshow’s ( R 2 ) measure is calculated as: 

2 -ILL (model) 

-2LL (baseline) (8-8) 


As such, R-l is calculated by dividing the model chi-square, which represents the change 
from the baseline (based on the log-likelihood) by the baseline —2LL (the deviance of the 
model before any predictors were entered). Given what the model chi-square represents, 
another way to express this is: 

—2LL (baseline)) - (-2LL (new 
—2LL (baseline) 



R 2 is the proportional reduction in the absolute value of the log-likelihood measure and 
as such it is a measure of how much the badness of fit improves as a result of the inclusion 
of the predictor variables. It can vary between 0 (indicating that the predictors are useless 
at predicting the outcome variable) and 1 (indicating that the model predicts the outcome 
variable perfectly). 

Cox and Snell’s R^ s (1989) is based on the deviance of the model ( —2LL(new)) and the 
deviance of the baseline model (—2LL(baseline)), and the sample size, n\ 


R cs= l- ex P 


|-2LL(new) - (-2LL (baseline j 


(8.9) 


v 


n 


7 
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However, this statistic never reaches its theoretical maximum of 1. Therefore, Nagelkerke 
(1991) suggested the following amendment (Nagelkerke’s R^): 


4 = 


R, 


cs 


1-exp - 


-2LL (baseline) 


( 8 . 10 ) 


Although all of these measures differ in their computation (and the answers you get), 
conceptually they are somewhat similar. So, in terms of interpretation they can be seen as 
similar to the R 2 in linear regression in that they provide a gauge of the substantive signifi¬ 
cance of the model. 


8 . 3 . 4 . 


Assessing the model: information criteria (D 


As we saw with linear regression, in section 7.6.3, we can use the Akaike information 
criterion (AIC) and the Bayes information criterion (BIC) to judge model fit. These two 
criteria exist to solve a problem with R 2 : that every time we add a variable to the model, R 2 
increases. We want a measure of fit that we can use to compare two models which penalizes 
a model that contains more predictor variables. You can think of this as the price you pay 
for something: you get a better value of R 2 , but you pay a higher price, and was that higher 
price worth it? These information criteria help you to decide. 

The AIC is the simpler of the two; it is given by: 

AIC = -2LL + 2k 

in which —2 LL is the deviance (described above) and k is the number of predictors in the 
model. The BIC is the same as the AIC but adjusts the penalty included in the AIC (i.e., 2k) 
by the number of cases: 

BIC = -ILL + 2kx log(w) 
in which n is the number of cases in the model. 


8 . 3 . 5 . 


Assessing the contribution of predictors: the z-statistic 


As in linear regression, we want to know not only how well the model overall fits the 
data, but also the individual contribution of predictors. In linear regression, we used the 
estimated regression coefficients ( b ) and their standard errors to compute a t-statistic. In 
logistic regression there is an analogous statistic - the z-statistic - which follows the normal 
distribution. Like the t-test in linear regression, the z-statistic tells us whether the b coeffi¬ 
cient for that predictor is significantly different from zero. If the coefficient is significantly 
different from zero then we can assume that the predictor is making a significant contribu¬ 
tion to the prediction of the outcome (Y): 

b 

z =- 

SE b 


( 8 . 11 ) 
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FIGURE 8.2 

Abraham Wald 
writing 1 must 
not devise 
test statistics 
prone to having 
inflated standard 
errors’ on the 
blackboard 100 
times 


Equation (8.11) shows how the ^-statistic is calculated and you can see it’s basically iden¬ 
tical to the t-statistic in linear regression (see equation (7.6)): it is the value of the regres¬ 
sion coefficient divided by its associated standard error. The ^-statistic is usually used to 
ascertain whether a variable is a significant predictor of the outcome; however, it is prob¬ 
ably more accurate to examine the likelihood ratio statistics. The reason why the ^-statistic 
should be used a little cautiously is because, when the regression coefficient ( b ) is large, the 
standard error tends to become inflated, resulting in the ^-statistic being underestimated 
(see Menard, 1995). The inflation of the standard error increases the probability of reject¬ 
ing a predictor as being significant when in reality it is making a significant contribution to 
the model (i.e., you are more likely to make a Type II error). The ^-statistic was developed 
by Abraham Wald (Figure 8.2), and is thus sometimes known as the Wald statistic. 


8 . 3 . 6 . 


The odds ratio (D 


More crucial to the interpretation of logistic regression is the value of the odds ratio, which 
is the exponential of B (i.e., e B or exp(B)) and is an indicator of the change in odds result¬ 
ing from a unit change in the predictor. As such, it is similar to the b coefficient in logistic 
regression but easier to understand (because it doesn’t require a logarithmic transforma¬ 
tion). When the predictor variable is categorical the odds ratio is easier to explain, so ima¬ 
gine we had a simple example in which we were trying to predict whether or not someone 
got pregnant from whether or not they used a condom last time they made love. The odds 
of an event occurring are defined as the probability of an event occurring divided by the 
probability of that event not occurring (see equation (8.12)) and should not be confused 
with the more colloquial usage of the word to refer to probability. So, for example, the 
odds of becoming pregnant are the probability of becoming pregnant divided by the prob¬ 
ability of not becoming pregnant: 
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odds = 


P(event) 
P(no event) 


P(event Y) = 
P(no event Y) = 


1 

1 - P(event Y) 


( 8 . 12 ) 


To calculate the change in odds that results from a unit change in the predictor, we must 
first calculate the odds of becoming pregnant given that a condom wasn’t used. We then 
calculate the odds of becoming pregnant given that a condom was used. Finally, we calcu¬ 
late the proportionate change in these two odds. 

To calculate the first set of odds, we need to use equation (8.3) to calculate the probabil¬ 
ity of becoming pregnant given that a condom wasn’t used. If we had more than one pre¬ 
dictor we would use equation (8.4). There are three unknown quantities in this equation: 
the coefficient of the constant ( b 0 ), the coefficient for the predictor (bj and the value of the 
predictor itself (X). We’ll know the value of X from how we coded the condom use variable 
(chances are we would’ve used 0 = condom wasn’t used and 1 = condom was used). The 
values of b 1 and b Q will be estimated for us. We can calculate the odds as in equation (8.12). 

Next, we calculate the same thing after the predictor variable has changed by one unit. 
In this case, because the predictor variable is dichotomous, we need to calculate the odds of 
getting pregnant, given that a condom was used. So, the value of Xis now 1 (rather than 0). 

We now know the odds before and after a unit change in the predictor variable. It is a 
simple matter to calculate the proportionate change in odds by dividing the odds after a 
unit change in the predictor by the odds before that change: 


Aodds 


odds after a unit change in the predictor 
original odds 


(8.13) 


This proportionate change in odds is the odds ratio, and we can interpret it in terms of the 
change in odds: if the value is greater than 1 then it indicates that as the predictor increases, 
the odds of the outcome occurring increase. Conversely, a value less than 1 indicates that 
as the predictor increases, the odds of the outcome occurring decrease. We’ll see how this 
works with a real example shortly. 


8.3.7. 


Methods of logistic regression © 


As with multiple regression (section 7.6.4), there are several different methods that can be 
used in logistic regression. 


8.3.7.I. The forced entry method © 

The default method of conducting the regression is simply to place predictors into the 
regression model in one block, and estimate parameters for each predictor. 


8.3.7.2. Stepwise methods © 

If you are undeterred by the criticisms of stepwise methods in the previous chapter, then 
you can select either a forward or a backward stepwise method, or a combination of them. 
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When the forward method is employed the computer begins with a model that includes 
only a constant and then adds single predictors to the model based on the criterion that 
adding the variable must improve the AIC or BIC (whichever you chose). The computer 
proceeds until none of the remaining predictors have decreased the criterion. 

The opposite of the forward method is the backward method. This method uses the 
same criteria, but instead of starting the model with only a constant, it begins the model 
with all predictors included. The computer then tests whether any of these predictors can 
be removed from the model without increasing the information criterion. If it can, it is 
removed from the model, and the variables are all tested again. 

Better than these two simple methods are the backward/forward and the forward/back¬ 
ward methods. These are hybrids of the two methods - the forward/backward approach 
starts off doing a forward method, but each time a variable is added, it tests whether it’s 
worth removing any variables. 


8.3.7.3. How do I select a method? (D 


As with ordinary regression (previous chapter), the method of regression chosen 
will depend on several things. The main consideration is whether you are testing 
a theory or merely carrying out exploratory work. As noted earlier, some people 
believe that stepwise methods have no value for theory testing. However, stepwise 
methods are defensible when used in situations where causality is not of interest and 
you merely wish to find a model to fit your data (Agresti & Finlay, 1986; Menard, 
1995). Also, as I mentioned for ordinary regression, if you do decide to use a step¬ 
wise method then the backward method is preferable to the forward method. This is 
because of suppressor effects, which occur when a predictor has a significant effect 
but only when another variable is held constant. Forward selection is more likely 
than backward elimination to exclude predictors involved in suppressor effects. As 
such, the forward method runs a higher risk of making a Type II error. 



8.4. Assumptions and things that can 
go wrong © 


8 . 4 . 1 . 


Assumptions © 


Logistic regression shares some of the assumptions of normal regression: 

1 Linearity: In ordinary regression we assumed that the outcome had linear relation¬ 
ships with the predictors. In logistic regression the outcome is categorical and so 
this assumption is violated. As I explained before, this is why we use the log (or 
logit) of the data. The linearity assumption in logistic regression, therefore, is that 
there is a linear relationship between any continuous predictors and the logit of the 
outcome variable. This assumption can be tested by looking at whether the interac¬ 
tion term between the predictor and its log transformation is significant (Hosmer & 
Lemeshow, 1989). We will go through an example in section 8.8.1. 

2 Independence of errors: This assumption is the same as for ordinary regression 
(see section 7.7.2.1). Basically it means that cases of data should not be related; for 
example, you cannot measure the same people at different points in time (well, you 
can actually, but then you have to use a multilevel model - see Chapter 19). 
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3 Multicollinearity: Although not really an assumption as such, multicollinearity is a 
problem as it was for ordinary regression (see section 7.7.2.1). In essence, predictors 
should not be too highly correlated. As with ordinary regression, this assumption can 
be checked with tolerance and VIF statistics, the eigenvalues of the scaled, uncentred 
cross-products matrix, the condition indices and the variance proportions. We go 
through an example in section 8.8.1. 

Logistic regression also has some unique problems of its own (not assumptions, but 
things that can go wrong). R solves logistic regression problems by an iterative procedure 
(R’s Souls’ Tip 8.1). Sometimes, instead of pouncing on the correct solution quickly, you’ll 
notice nothing happening: R begins to move infinitely slowly, or appears to have got fed 
up with you asking it to do stuff and gone on strike. If it can’t find a correct solution, then 
sometimes it actually does give up, quietly offering you (without any apology) a result that 
is completely incorrect. Usually this is revealed by implausibly large standard errors. Two 
situations can provoke this situation, both of which are related to the ratio of cases to vari¬ 
ables: incomplete information and complete separation. 



Error messages about ‘failure to converge’ © 


Many statistical procedures use an iterative process, which means that R attempts to estimate the parameters 
of the model by finding successive approximations of those parameters. Essentially, it starts by estimating the 
parameters with a ‘best guess’. It then attempts to approximate them more accurately (known as an iteration). It 
then tries again, and then again, and so on through many iterations. It stops either when the approximations of 
parameters converge (i.e., at each new attempt the ‘approximations’ of parameters are the same or very similar 
to the previous attempt), or it reaches the maximum number of attempts (iterations). 

Sometimes you will get an error message in the output that says something like 


Warning messages: 

1: glm.fit: algorithm did not converge 


What this means is that R has attempted to estimate the parameters the maximum number of times (as specified in 
the options) but they are not converging (i.e., at each iteration R is getting quite different estimates). This certainly 
means that you should ignore any output that R has produced, and it might mean that your data are beyond help. 


8 . 4 . 2 . 


Incomplete information from the predictors © 


Imagine you’re trying to predict lung cancer from smoking and whether or not you eat 
tomatoes (which are believed to reduce the risk of cancer). You collect data from people 
who do and don’t smoke, and from people who do and don’t eat tomatoes; however, this 
isn’t sufficient unless you collect data from all combinations of smoking and tomato eating. 
Imagine you ended up with the following data: 


Do you smoke? 

Do you eat tomatoes? 

Do you have cancer? 

Yes 

No 

Yes 

Yes 

Yes 

Yes 

No 

No 

Yes 

No 

Yes 

?????? 
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Observing only the first three possibilities does not prepare you for the outcome of the 
fourth. You have no way of knowing whether this last person will have cancer or not based 
on the other data you’ve collected. Therefore, R will have problems unless you’ve collected 
data from all combinations of your variables. This should be checked before you run the 
analysis using a crosstabulation table, and I describe how to do this in Chapter 18. While 
you’re checking these tables, you should also look at the expected frequencies in each cell 
of the table to make sure that they are greater than 1 and no more than 20% are less than 
5 (see section 18.5). This is because the goodness-of-fit tests in logistic regression make this 
assumption. 

This point applies not only to categorical variables, but also to continuous ones. Suppose 
that you wanted to investigate factors related to human happiness. These might include 
age, gender, sexual orientation, religious beliefs, levels of anxiety and even whether a per¬ 
son is right-handed. You interview 1000 people, record their characteristics, and whether 
they are happy (‘yes’ or ‘no’). Although a sample of 1000 seems quite large, is it likely to 
include an 80-year-old, highly anxious, Buddhist, left-handed lesbian? If you found one 
such person and she was happy, should you conclude that everyone else in the same cat¬ 
egory is happy? It would, obviously, be better to have several more people in this category 
to confirm that this combination of characteristics predicts happiness. One solution is to 
collect more data. 

As a general point, whenever samples are broken down into categories and one or more 
combinations are empty it creates problems. These will probably be signalled by coef¬ 
ficients that have unreasonably large standard errors. Conscientious researchers produce 
and check multiway crosstabulations of all categorical independent variables. Lazy but 
cautious ones don’t bother with crosstabulations, but look carefully at the standard errors. 
Those who don’t bother with either should expect trouble. 


8 . 4 . 3 . 


Complete separation © 


A second situation in which logistic regression collapses might surprise you: it’s when the 
outcome variable can be perfectly predicted by one variable or a combination of variables! 
This is known as complete separation. 


Figure 8.3 

An example of 
the relationship 
between weight 
(x-axis) and a 
dichotomous 
outcome variable 
(/-axis, 1 = 
Burglar, 0 = 
Teenager) - note 
that the weights 
in the two groups 
overlap 
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FIGURE 8.4 

An example 
of complete 
separation - note 
that the weights 
(x-axis) of the 
two categories in 
the dichotomous 
outcome variable 
(/-axis, 1 = 
Burglar, 0 = Cat) 
do not overlap 


Let’s look at an example: imagine you placed a pressure pad under your doormat and 
connected it to your security system so that you could detect burglars when they creep in 
at night. However, because your teenage children (which you would have if you’re old 
enough and rich enough to have security systems and pressure pads) and their friends 
are often coming home in the middle of the night, when they tread on the pad you want 
it to work out the probability that the person is a burglar and not one of your teenagers. 
Therefore, you could measure the weight of some burglars and some teenagers and use 
logistic regression to predict the outcome (teenager or burglar) from the weight. The graph 
(Figure 8.3) would show a line of triangles at zero (the data points for all of the teenagers 
you weighed) and a line of triangles at 1 (the data points for burglars you weighed). Note 
that these lines of triangles overlap (some teenagers are as heavy as burglars). We’ve seen 
that in logistic regression, R tries to predict the probability of the outcome given a value 
of the predictor. In this case, at low weights the fitted probability follows the bottom line 
of the plot, and at high weights it follows the top line. At intermediate values it tries to 
follow the probability as it changes. 

Imagine that we had the same pressure pad, but our teenage children had left home to 
go to university. We’re now interested in distinguishing burglars from our pet cat based 
on weight. Again, we can weigh some cats and weigh some burglars. This time the graph 
(Figure 8.4) still has a row of triangles at zero (the cats we weighed) and a row at 1 (the 
burglars) but this time the rows of triangles do not overlap: there is no burglar who weighs 
the same as a cat - obviously there were no cat burglars in the sample (groan now at that 
sorry excuse for a joke). This is known as perfect separation: the outcome (cats and bur¬ 
glars) can be perfectly predicted from weight (anything less than 15 kg is a cat, anything 
more than 40 kg is a burglar). If we try to calculate the probabilities of the outcome given 
a certain weight then we run into trouble. When the weight is low, the probability is 0, 
and when the weight is high, the probability is 1, but what happens in between? We have 
no data between 15 and 40 kg on which to base these probabilities. The figure shows two 
possible probability curves that we could fit to these data: one much steeper than the other. 
Either one of these curves is valid, based on the data we have available. The lack of data 
means that R will be uncertain about how steep it should make the intermediate slope and 
it will try to bring the centre as close to vertical as possible, but its estimates veer unsteadily 
towards infinity (hence large standard errors). 
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This problem often arises when too many variables are fitted to too few cases. Often the 
only satisfactory solution is to collect more data, but sometimes a neat answer is found by 
using a simpler model. 



CRAMMING SAM’S TIPS 


Issues in logistic regression 


• In logistic regression, like ordinary regression, we assume linearity, no multicollinearity and independence of errors. 

• The linearity assumption is that each predictor has a linear relationship with the log of the outcome variable. 

• If we created a table that combined all possible values of all variables then we should ideally have some data in every cell of 
this table. If we don’t then we should watch out for big standard errors. 

• If the outcome variable can be predicted perfectly from one predictor variable (or a combination of predictor variables) then 
we have complete separation. This problem creates large standard errors too. 


8.5. Packages used in this chapter © 


There are several packages we will use in this chapter. Some, but not all, can be accessed 
through R Commander. You will need the packages car (to recode variables and test multi¬ 
collinearity) and mlogit (for multinomial logistic regression). If you don’t have these pack¬ 
ages installed you’ll need to install them and load them. 

install.packages("car"); install.packages("mlogit") 

Then you need to load the packages by executing these commands: 

library(car); libraryCmlogit) 


8.6. Binary logistic regression: an example that 
will make you feel eel © 


It’s amazing what you find in academic journals sometimes. It’s a bit of a hobby of mine trying 
to unearth bizarre academic papers (really, if you find any, email them to me). I believe that sci¬ 
ence should be fun, and so I like finding research that makes me laugh. A research paper by Lo 
and colleagues is the one that (so far) has made me laugh the most (Lo, Wong, Leung, Law, &C 
Yip, 2004). Lo et al. report the case of a 50-year-old man who presented himself at the Accident 
and Emergency Department (ED for the Americans) with abdominal pain. A physical examina¬ 
tion revealed peritonitis so they took an X-ray of the man’s abdomen. Although it somehow 
slipped the patient’s mind to mention this to the receptionist upon arrival at the hospital, the 
X-ray revealed the shadow of an eel. The authors don’t directly quote the man’s response to 
this news, but I like to imagine it was something to the effect of ‘Oh, that! Erm, yes, well I didn’t 
think it was terribly relevant to my abdominal pain so I didn’t mention it, but I did insert an eel 
into my anus this morning. Do you think that’s the problem?’ Whatever he did say, the authors 
report that he admitted to inserting an eel into his anus to ‘relieve constipation’. 

I can have a lively imagination at times, and when I read this article I couldn’t help think¬ 
ing about the poor eel. There it was, minding its own business swimming about in a river 
(or fish tank possibly), thinking to itself ‘Well, today seems like a nice day, there are no 
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eel-eating sharks about, the sun is out, the water is nice, what could possibly go 
wrong?’ The next thing it knows, it’s being shoved up the anus of a man from 
Hong Kong. ‘Well, I didn’t see that coming’, thinks the eel. Putting myself in 
the mindset of an eel for a moment, he has found himself in a tight dark tunnel, 
there’s no light, there’s a distinct lack of water compared to his usual habitat, and 
he probably fears for his life. His day has gone very wrong. How can he escape 
this horrible fate? Well, doing what any self-respecting eel would do, he notices 
that his prison cell is fairly soft and decides ‘bugger this, 2 I’ll eat my way out of 
here’. Unfortunately he didn’t make it, but he went out with a fight (there’s a 
fairly unpleasant photograph in the article of the eel biting the splenic flexure). 
The authors conclude that ‘Insertion of a live animal into the rectum causing rectal perfora¬ 
tion has never been reported. This may be related to a bizarre healthcare belief, inadvertent 
sexual behavior , or criminal assault. However, the true reason may never be known.’ Quite. 

OK, so this is a really grim tale. 3 It’s not really very funny for the man or the eel, but it did 
make me laugh. Of course my instant reaction was that sticking an eel up your anus to ‘relieve 
constipation’ is the poorest excuse for bizarre sexual behaviour I have ever heard. But upon 
reflection I wondered if I was being harsh on the man - maybe an eel up the anus really can 
cure constipation. If we wanted to test this, we could collect some data. Our outcome might be 
‘constipated’ vs. ‘not constipated’, which is a dichotomous variable that we’re trying to predict. 
One predictor variable would be intervention (eel up the anus) vs. waiting list (no treatment). 
We might also want to factor how many days the patient had been constipated before treat¬ 
ment. This scenario is perfect for logistic regression (but not for eels). The data are in Eel.dat. 

I’m quite aware that many statistics lecturers do not share my unbridled joy at discussing 
eel-created rectal perforations with students, so I have named the variables in the file more 
generally: 

• outcome (dependent variable): Cured (cured or not cured); 

• predictor (independent variable): Intervention (intervention or no treatment); 

• predictor (independent variable): Duration (the number of days before treatment 
that the patient had the problem). 


In doing so, your tutor can adapt the example to something more palatable if they wish to, 
but you will secretly know that the example is all about putting eels up your bum. 


8 . 6 . 1 . 


Preparing the data ® 


To carry out logistic regression, the data must be entered as for normal regression: they 
are arranged in whatever data editor you use in three columns (one representing each vari¬ 
able). First load the data file by setting your working directory to the location of the file 
(see section 3.4.4) and executing: 

eelData<-read.delim("eel.dat", header = TRUE) 


2 Literally. 

3 As it happens, it isn’t an isolated grim tale. Through this article I found myself hurtling down a road of morbid 
curiosity that was best left untravelled. Although the eel was my favourite example, I could have chosen from 
a very large stone (Sachdev, 1967), a test tube (Hughes, Marice, 8c Gathright, 1976), a baseball (McDonald 8c 
Rosenthal, 1977), an aerosol deodorant can, hose pipe, iron bar, broomstick, penknife, marijuana, bank notes, 
blue plastic tumbler, vibrator and primus stove (Clarke, Buccimazza, Anderson, 8c Thomson, 2005), or (a close 
second place to the eel) a toy pirate ship, with or without pirates I’m not sure (Bemelman Sc Hammacher, 2005). 
So, although I encourage you to send me bizarre research, if it involves objects in the rectum then probably don’t, 
unless someone has managed to put Buckingham Palace up there. 
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This creates a dataframe called eelData. We can look at the data using the head() func¬ 
tion, which shows the first six rows of the dataframe entered into the function: 

headCeelData) 


Cured Intervention Duration 


1 

Not 

Cured 

No 

Treatment 

7 

2 

Not 

Cured 

No 

Treatment 

7 

3 

Not 

Cured 

No 

Treatment 

6 

4 


Cured 

No 

Treatment 

8 

5 


Cured 

Intervention 

7 

6 


Cured 

No 

Treatment 

6 


Note that we have three variables in different columns. The categorical data have been 
entered as text. For example, the variable Cured is made up of the phrases Cured and Not 
Cured. When we read in data that are text strings like this, R helpfully converts them to 
factors. It doesn’t tell us that it has done this, it just does it. 

When we do logistic regression, we want to do it with numbers, not words. In creating 
the factors R also helpfully assigned some numbers to the variables. There is no end to how 
helpful R will try to be. The trouble is that the numbers that R has assigned might not be 
the numbers that we want. In fact, R creates levels of the factor by taking the text strings 
in alphabetical order and assigning them ascending numerical values. In other words, for 
Cured we have two categories and R will have ordered these categories alphabetically 
(i.e., ‘Cured’ and ‘Not Cured’). So, Cured will be the baseline category because it is first. 
Likewise, for Intervention the categories were Intervention and No Treatment , so given the 
alphabetic order Intervention will be the baseline category. 

However, it makes more sense to code both of these variables the opposite way around. 
For Cured it would be good if Not Cured was the baseline, or first category, because then we 
would know that the model coefficients reflect the probability of being cured (which is what 
we want to know) rather than the probability of not being cured. Similarly, for Intervention 
it would be useful if No Treatment were the first category (i.e., the baseline). Fortunately, the 
function relevel() lets us specify the baseline category for a factor. It takes the general form: 

newFactor<-relevel(oldFactor, "baseline category") 

In other words, we can create a factor by specifying an existing factor, and simply writing 
the name of the baseline category in quotes. For Cured and Intervention, it makes most 
sense not to create new factors, but just to overwrite the existing ones, therefore, we spe¬ 
cify these variables as both the new and old factors; this will simply respecify the baseline 
category of the existing variables. Execute these commands: 

eelData$Cured<-relevel(eelData$Cured, "Not Cured") 

eelData$Intervention<-relevelfeelData$Intervention, "No Treatment") 

The variable Cured now has Not Cured as the first level (i.e., the baseline category), and 
Intervention now has No Treatment as the baseline category. Having set our baseline cat¬ 
egories, we can get on with the analysis. 


8 . 6 . 2 . 


The main logistic regression analysis © 


8.6.2.1. Basic logistic regression analysis using R Commander © 

First, import the data, using the Data=t>Import data=>from text file, clipboard, or URL... 
menu to set the import options and choose the file eel.dat (see section 3.7.3). As discussed 
in the previous section, R will import the variables Cured and Intervention as factors 
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FIGURE 8.5 

Reordering 
a factor in R 
Commander 


FIGURE 8.6 

Dialog box for 
generalized linear 
models in R 
Commander 



(because they contain text) but we might not like the baseline category that it sets by default. 
Therefore, the first thing that we need to do is to set the baseline category to one that we 
want. We can do this by selecting Data=>Manage variables in active data set=>Reorder fac¬ 
tor levels... as shown in Figure 8.5. In the first dialog box there is a list of factors (labelled 
Factor (pick one))-, select the factor that you want to reorder (I have selected Cured). By 
default the function will simply overwrite the existing factor, which is why the Name for 
factor box contains <same as original>\ however, if you want to create a new variable then 
rep lace t he text in this box with a new name. Having selected a factor and named it, click 
on I 0K 1 . The next dialog box displays the categories contained within the selected factor 
and their order. Note that we have two categories - Cured and Not Cured - and the 1 and 
2 reflects their order ( Cured is first, and Not Cured second). We want to reverse this order, 
so we need to change the numbers so that Cured is 2 and Not Cured is 1 (which will make 
it the ba seline category). Once you have edited the numbers to reflect the order you want 
click on I 0K I to make the change. You can repeat the process for the Intervention variable. 

We will carry out a hierarchical regression: in model 1, we’ll include only Intervention 
as a predictor, and then in model 2 we’ll add Duration. Let’s create the first model. To run 
binary logistic regression, choose Statistics=>Fit models=>Generalized linear model... to 
access the dialog box in Figure 8.6. In the box labelled Enter name for model: we enter a 
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name for the model we are going to estimate; I’ve called it eelModel.l. Next, we need to 
create the formula that described the model. The formula consists of an outcome variable, 
which should be a dichotomous factor variable for logistic regression. In our case, the vari¬ 
able is Cured. (Notice that R Commander has labelled the variable as a factor in the list of 
variables.) First double-click on Cured, which will move it to box under the label Model 
Formula: (which is where the cursor would be). Having specified the outcome variable, the 
cursor will hop into the box to the right, which is where we want to put any predictors. 
There are several buttons above this box that make it easy for us to add useful things like 
‘ + ’ (to add predictors), or ‘*’ (for interaction terms) or even brackets. We can build the 
predictors up by double-clicking on them in the list of variables and adding appropriate 
symbols. In model 1 we want only Intervention as a predictor, so double-click on this vari¬ 
able in the list (the completed box should look like the left-hand box in Figure 8.6). 

When we use a generalized linear model we need to specify a family and link function. The 
family relates to the type of distribution that we assume; for example, if we choose Gaussian, 
that means we are assuming a normal distribution. We would choose this for linear regres¬ 
sion. For logistic regression we choose binomial. We also need to choose a link function - for 
logistic regression, we choose the logit. R Commander helpfully selects these by default. 

We generate the second model in much the same way. In the box labelled Enter name for 
model: enter a name for the model; I’ve called it eelModel.l. Next, double-click on Cured, 
to move it to the left-hand box under the label Model Formula: (which is where the cursor 
would be). Then to specify the predictors, double-click on Intervention to move it to the 
right-hand box under the label Model Formula:, then type ‘ + ’ or click on 3 , then double¬ 
click on Duration in the list to move it to the formula box. The finished dialog box should 
look like the right-hand dialog box in box in Figure 8.6. 


8 . 6 . 3 . 


Basic logistic regression analysis using R © 


To do logistic regression, we use the glm() function. The glm() function is very similar to 
the ImQ function that we saw in Chapter 7. While Im stands for ‘linear model’, glm stands 
for ‘generalized linear model’ - that is, the basic linear model that has been generalized to 
other sorts of situations. The general form of this function is: 

newModelc-glmCoutcome ~ predictor(s), data = dataFrame, family = name of a 
distribution, na.action = an action) 

in which: 

• newModel is an object created that contains information about the model. We can get 
summary statistics for this model by executing summary (newModel). 

• outcome is the variable that you’re trying to predict, also known as the dependent 
variable. In this example it will be the variable Cured. 

• predictor(s) lists the variable or variables from which you’re trying to predict the out¬ 
come variable. In this example it will be the variables Cured and Duration. 

• dataFrame is the name of the dataframe from which your outcome and predictor 
variables come. 

• family is the name of a distribution (e.g., Gaussian, binomial, poisson, gamma). 

• na.act ion is an optional command. If you have complete data (as we have here) you 
can ignore it, but if you have missing values (i.e., NAs in the dataframe) then it can 
be useful - see R’s Souls’ Tip 7.1). 
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The format, as you can see, is extremely similar to lm() in that we specify a formula that 
describes our model, we specify the dataframe that contains the variables in the formula, 
and we can use na.action to determine how to deal with missing cases. The only difference 
is a new option called family. This enables us to tell R the detail of the kind of regression 
that we want to do: by specifying the distribution on which our model is based. If we were 
doing an ordinary regression (as in Chapter 7) we could set this option to Gaussian (which 
is another name for a normal distribution) because ordinary regression is based on a nor¬ 
mal distribution. Logistic regression is based on a binomial distribution, so we need to set 
this option to family = binomial(). 4 

We will carry out a hierarchical regression: in model 1, we’ll include only Intervention 
as a predictor, and then in model 2 we’ll add Duration. To create the first model we can 
execute: 

eelModel.l <- glm(Cured ~ Intervention, data = eelData, family = binomialO) 

This command creates a model called eelModel.l in which Cured is predicted from only 
Intervention ( Cured ~ Intervention) based on a logit function. Similarly, we can create the 
second model by executing: 

eelModel.2 <- glmCCured ~ Intervention + Duration, data = eelData, family = 
binomialO) 

This command creates a model called eelModel.l in which Cured is predicted from both 
Intervention and Duration (Cured ~ Intervention + Duration). 


8 . 6 . 4 . 


Interpreting a basic logistic regression © 


To see the models that we have just generated we need to execute the summaryf) function 
(remembering to put the model name into the function): 

summaryCeelModel.1) 
summaryCeelModel. 2) 

The results are shown in Outputs 8.1 and 8.3 and are discussed in the next two sections. 


8 . 6 . 5 . 


Model 1: Intervention only © 


Output 8.1 shows the model summary for model 1, which used Intervention to predict 
Cured. First, we should look at the summary statistics about the model. The overall fit 
of the model is assessed using the deviance statistic (to recap: this is —2 times the log- 
likelihood). Remember that larger values of the deviance statistic indicate poorer-fitting 
statistical models. R provides two deviance statistics: the null deviance and the residual 
deviance. The null deviance is the deviance of the model that contains no predictors other 


4 R has a number of useful defaults. If you don’t specify a family, R assumes that you want to use a Gaussian family 
of distributions, which is the same as using ImQ. In addition, you can specify a link function for the binomial 
family. The logit and probit are two commonly used link functions, which are specified as Binomial(link = “logit”) 
and Binomial(link = “probit”). If you don’t specify a link function, R chooses the logit link function for you, 
which is what is needed for logistic regression so we don’t need to explicitly use a link function in our model. 
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than the constant - in other words, -2LL(baseline). 5 The residual deviance is the deviance 
for the model - in other words, -2LL(new). 

Call: 

glm(formula = Cured ~ Intervention, family = binomial(), data = eelData) 
Deviance Residuals: 

Min IQ Median 3Q Max 

-1.5940 -1.0579 0.8118 0.8118 1.3018 


Coefficients: 


Estimate Std. Error z value 


(Intercept) -0.2877 0.2700 -1.065 

Interventionlntervention 1.2287 0.3998 3.074 


Pr(>|z | ) 
0.28671 
0.00212 ** 


Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 


0.1 ' ' 1 


(Dispersion parameter for binomial family taken to be 1) 


Null deviance: 154.08 on 112 degrees of freedom 
Residual deviance: 144.16 on 111 degrees of freedom 
AIC: 148.16 


Number of Fisher Scoring iterations: 4 

Output 8.1 

At this stage of the analysis the value of the deviance for the model should be less than 
the value for the null model (when only the constant was included in the model) because 
lower values of —2LL indicate that the model is predicting the outcome variable more 
accurately. For the null model, —ILL = 154.08, but when Intervention has been included 
this value has been reduced to 144.16. This reduction tells us that the model is better at 
predicting whether someone was cured than it was before Intervention was added. 

The question of how much better the model predicts the outcome variable can be 
assessed using the model chi-square statistic, which measures the difference between the 
model as it currently stands and the model when only the constant was included. We saw 
in section 8.3.1 that we could assess the significance of the change in a model by taking the 
log-likelihood of the new model and subtracting the log-likelihood of the baseline model 
from it. The value of the model chi-square statistic works on this principle and is, there¬ 
fore, equal to —2LL with Intervention included minus the value of —2LL when only the 
constant was in the model (154.08 — 144.16 = 9.92). This value has a chi-square distri¬ 
bution and so its statistical significance can be calculated easily. In this example, the value 
is significant at a .05 level and so we can say that overall the model is predicting whether 
a patient is cured or not significantly better than it was with only the constant included. 
The model chi-square is an analogue of the L-test for the linear regression (see Chapter 7). 
In an ideal world we would like to see a non-significant overall —2LL (indicating that the 
amount of unexplained data is minimal) and a highly significant model chi-square statistic 
(indicating that the model including the predictors is significantly better than without those 
predictors). However, in reality it is possible for both statistics to be highly significant. 

We can use R to automatically calculate the model chi-square and its significance. We 
can do this by treating the output model as data. The object eelModel.l has a number of 


5 You can try this by running a model with only an intercept. Use: 

eelModel.0 <- glm(Cured ~ 1, data = eelData, family = binomialQ) 
summaryCeelModel.0) 
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variables associated with it. Two of these are the deviance and null.deviance. By subtracting 
the deviance from the null deviance we can find the improvement, which gives us a chi- 
square statistic. We can reference these variables the same as any other: by appending the 
variable name to the model using dollar sign. To calculate this value execute: 

modelChi <- eelModel.l$null.deviance - eelModel.l$deviance 

This creates a value called modelChi that is the deviance for the model eelModel. 1 sub¬ 
tracted from the null deviance from the same model. We can see the value by executing: 

modelChi 
[1] 9.926201 

As you can see, this value corresponds to the one we calculated by hand (allowing for 
differences in rounding). Similarly, the degrees of freedom for the model are stored in the 
variable df.residual and for the null model are stored as df.null. These are the values of 111 
and 112 in Output 8.1. We can compute the degrees of freedom associated with the chi- 
square statistic that we just computed by subtracting the degrees of freedom exactly as we 
did for the deviance values. Execute: 

chidf <- eelModel.l$df.null - eelModel.l$df.residual 

This creates a value called chidf that is the degrees of freedom for the model eelModel. 1 sub¬ 
tracted from the degrees of freedom for the null model. We can see the value by executing: 

chidf 
[1] l 

As you can see, the change in degrees of freedom is 1, which reflects the fact that we have 
only one variable in the model. 

To calculate the probability associated with this chi-square statistic we can use the pchisq() 
function. This function needs two things: the value of chi-square (which we have just computed 
as modelChi) and the degrees of freedom (which we have just computed as chidf). The prob¬ 
ability we want is 1 minus the value of the pchisq() function, which we can obtain by executing: 

chisq.prob <- 1 - pchisq(modelChi, chidf) 

This command creates an object called chisq.prob, which is 1 minus the result of the pchisq() 
function (note that we have placed the variables containing of the chi-square statistic and 
its degrees of freedom directly into this function). To see the value we execute: 

chisq.prob 

[1] 0.001629425 

In other words, the p-value is .002 (rounded to three decimal places); because this prob¬ 
ability is less than .05, we can reject the null hypothesis that the model is not better than 
chance at predicting the outcome. This value is the likelihood ratio p-value of the model 
because we only had one predictor in the model. We can report that including Intervention 
produced a significant improvement in the fit of the model, x 2 (l) = 9.93, p = .002. 

Next, we consider the coefficients. This part is crucial because it tells us the estimates for the 
coefficients for the predictors included in the model. This section of the output gives us the 
coefficients and statistics for the variables that have been included in the model at this point 
(namely Intervention and the constant). The b-value is the same as the h-value in linear regres¬ 
sion: they are the values that we need to replace in equation (8.4) to establish the probability 
that a case falls into a certain category. We saw in linear regression that the value of b repre¬ 
sents the change in the outcome resulting from a unit change in the predictor variable. The 
interpretation of this coefficient in logistic regression is very similar in that it represents the 
change in the logit of the outcome variable associated with a one-unit change in the predictor 
variable. The logit of the outcome is simply the natural logarithm of the odds of Y occurring. 
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The crucial statistic is the ^-statistic which has a normal distribution and tells us whether 
the b coefficient for that predictor is significantly different from zero. 6 If the coefficient is 
significantly different from zero then we can assume that the predictor is making a signifi¬ 
cant contribution to the prediction of the outcome (Y). We came across the ^-statistic in 
section 8.3.4 and saw that it should be used cautiously because when the regression coef¬ 
ficient ( b ) is large, the standard error tends to become inflated, resulting in the ^-statistic 
being underestimated (see Menard, 1995). However, for these data it seems to indicate that 
having the intervention (or not) is a significant predictor of whether the patient is cured 
(note that the significance of the ^-statistic is less than .05). We could say that Intervention 
was a significant predictor of being cured, b = 1.23, z = 3.07, p < .002. 

In section 8.3.3 we saw that we could calculate an analogue of R using equation (8.7). 
For these data, the ^-statistic and its df can be read from the R output (3.074 and 1, respec¬ 
tively), and the null model deviance was 154.08. Therefore, R can be calculated as: 

R= l 3 - 0742 - 2 *l = 0.22 (8.14) 

V 154.08 

In the same section we saw that Hosmer and Lemeshow’s measure ( ) is calculated by 

dividing the model chi-square by the original —2LL. In this example the model chi-square 
after Intervention has been entered into the model is 9.93 (calculated as modelChi above), 
and the original —2 LL (before any variables were entered) was 154.08 (the deviance.null). 
So, R l = 9.93/154.08 = .06, which is different from the value we would get by squaring 
the value of R given above ( R 2 = .22 2 = 0.05). 

We can get R to do this calculation for us by executing: 

R2.hl<-modelChi/eelModel.l$null.deviance 
R2. hi 

[1] 0.06442071 

The first command simply takes the value of the model chi-square (which we have already 
calculated as modelChi , and divides it by the —2 LL for the original model ( eelModel.l$null. 
deviance)). This is a direct analogue of the equation given earlier in the chapter. The second 
command displays the value, which is .064. 

We also saw two other measures of R 2 that were described in section 8.3.3, Cox and 
Snell’s and Nagelkerke’s. There are functions available in R to calculate these, but they’re a 
bit of a pain to find and use. It’s easy enough, however, to write commands in R to calculate 
them. We can write the equation for the Cox and Snell statistic as: 

R.cs <- 1 - exp ((eelModel.l$deviance - eelModel.l$null.deviance) /113) 

R.cs 

[1] 0.08409487 

The first command uses the —2 LL for the model (eelModel.l$deviance) and the null model 
(eelModel.l$null.deviance) and divides the difference by the sample size (in this case 113, 
but you will need to change this value for other data sets). The second command will dis¬ 
play the result: a value of .084. We can use this value to calculate Nagelkerke’s estimate 
also. Again, we just write out the equation in R-speak: 

R.n <- R.cs /(l-(exp(-(eelModel.l$null.deviance/113)))) 

R.n 

[1] 0.112992 

The first command uses the value we just calculated, R.cs, and adjusts it using the —2 LL for 
the null model (eelModel.l$mdl.deviance) and the sample size (which, again you’ll need to 


6 You might also come across a Wald statistic - this is the square of the ^-statistic and is distributed as chi-square. 
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change for other data sets, but is 113 in the current one). The second command will display 
the result: a value of .113. 

Alternatively you could write a function to calculate all three Revalues from a model, which 
has the advantage that you can reuse the function for other models (R’s Souls’ Tip 8.2). As you 
can see, all of the values of R 2 differ, but they can be used as effect size measures for the model. 



Writing a function to compute R 2 (D 


We saw in (R’s Souls’ Tip 6.2) that it’s possible to write functions to do things for us in R. If we were running 
several logistic regression models it would get fairly tedious to keep typing out the commands to calculate the 
various values of Ft 2 . Therefore, we could wrap them up in a function. We’ll call this function logisticPseudoR2s so 
that we know what it does, and we’ll enter a logistic regression model (we’ve named this LogModel) into it as the 
input. To create the function execute: 


logisticPseudoR2s <- function(LogModel) { 
dev <- LogModel$deviance 
nullDev <- LogModel$null.deviance 
modelN <- length(LogModel$fitted.values) 

R.l <- 1 - dev / nullDev 

R.cs <- 1- exp C -(nullDev - dev) / modelN) 

R.n <- R.cs / ( 1 - ( exp (-(nullDev / modelN)))) 
cat("Pseudo R A 2 for logistic regression\n") 
cat("Hosmer and Lemeshow R A 2 ", round(R.l, 3), "\n") 
cat("Cox and Snell R A 2 ", round(R.cs, 3), "\n") 

cat("Nagelkerke R A 2 ", round(R.n, 3), "\n") 

} 


Taking each line in turn: 

• dev<-LogModel$deviance extracts the model deviance (-2/_L(new)) of the model entered into the function 
and calls it dev. 

• nullDev<-LogModel$null.deviance extracts the baseline deviance (-2/_/_(baseline)) of the model entered into 
the function and calls is nullDev. 

• modelN<-length(LogModel$fitted.values) uses the lengthf) function on the fitted value to compute the sample 
size, which it calls modelN. 

• R.l <-1 - dev/nullDev computes Hosmer and Lemeshow’s measure [R 2 J using the values extracted from the 
model and calls it R.l. 

• R.cs<-1- exp (-(nullDev - dev)/modelN): computes Cox and Snell’s measure (R 2 S ) using the values extracted 
from the model and calls it R.cs. 

• R.n <- R.cs / (1 - (exp (-(nullDev / modelN)))) computes Nagelkerke’s measure (R(() using the values extracted 
from the model and calls it R.n. 

• cat() : The last four lines use the cat() function to print the text in quotes, plus the various versions of R 2 rounded 
to three decimal places. 

To use the function on our model, we simply place the name of the logistic regression model (in this case 

eelModel. 1) into the function and execute: 

TogisticPseudoR2s(eelModel.1) 

The output will be: 

Pseudo R~2 for logistic regression 

Hosmer and Lemeshow Rh2 0.064 

Cox and Snell Rh2 0.084 

Nagelkerke R^2 0.113 
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The final thing we need to think about is the odds ratios, which were described in section 
8.3.6. To calculate the change in odds that results from a unit change in the predictor for 
this example, we must first calculate the odds of a patient being cured given that they didn’t 
have the intervention. We then calculate the odds of a patient being cured given that they 
did have the intervention. Finally, we calculate the proportionate change in these two odds. 

To calculate the first set of odds, we need to use equation (8.12) to calculate the prob¬ 
ability of a patient being cured given that they didn’t have the intervention. The parameter 
coding at the beginning of the output told us that patients who did not have the interven¬ 
tion were coded with a 0, so we can use this value in place of X. The value of b 1 has been 
estimated for us as 1.229 (see Coefficients: in Output 8.1), and the coefficient for the 
constant can be taken from the same table and is —0.288. We can calculate the odds as: 


P(Cured) 1 + g _ (6o + 6iXl) 1 + g -[-o .288 + (1.229 x o)j 

P(Not Cured) = 1 - P(Cured) = 1 - .428 = .527 
428 

odds = — = 0.748 
.572 


= .428 


(8.15) 


Now we calculate the same thing after the predictor variable has changed by one unit. In 
this case, because the predictor variable is dichotomous, we need to calculate the odds 
of a patient being cured, given that they have had the intervention. So, the value of the 
intervention variable, X, is now 1 (rather than 0). The resulting calculations are as follows: 


P(Cured) ^ + ^-(b,, + b,x,) 1 + e -[- 0.288 + (i. 229 xi)] - 719 

P(Not Cured) = 1 - P(Cured) = 1 - .719 = .281 
719 

odds = --= 2.559 

.281 


We now know the odds before and after a unit change in the predictor variable. It is now 
a simple matter to calculate the proportionate change in odds by dividing the odds after a 
unit change in the predictor by the odds before that change: 

^ ^ odds after a unit change in the predictor 
original odds 

_ 2.56 (8.17) 

“ 0.75 
= 3.41 

We can also calculate the odds ratio as the exponential of the b coefficient for the predictor 
variables. These coefficients are stored in a variable called coefficients, which is part of the 
model we created. Therefore, we can access this variable as: 

eelModel.l$coefficients 

This just means ‘the variable called coefficients within the model called eelModel. V. It’s a 
simple matter to apply the exp() function to this variable to find out the odds ratio: 

exp(eelModel.l$coefficients) 

Executing this command will display the odds ratio for the predictors in the model: 


(Intercept) Interventionlntervention 
0.750000 3.416667 
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The odds ratio for Intervention (3.417) is the same as we calculated above (allowing for 
differences in rounding). We can interpret the odds ratio in terms of the change in odds. If 
the value is greater than 1 then it indicates that as the predictor increases, the odds of the 
outcome occurring increase. Conversely, a value less than 1 indicates that as the predictor 
increases, the odds of the outcome occurring decrease. In this example, we can say that the 
odds of a patient who is treated being cured are 3.42 times higher than those of a patient 
who is not treated. 

We can also calculate confidence intervals for the odds ratios. To obtain confidence 
intervals of the parameters, we use the confint() function - just as we did for ordinary 
regression. We can also exponentiate these with the exp() function. To get the confidence 
intervals execute: 

exp(confint(eelModel. 1)) 

This function computes the confidence intervals for the coefficients in the model 
(confint(eelModel.l)) and then uses exp() to exponentiate them. The resulting output is 
in Output 8.2. The way to interpret this confidence interval is the same as any other con¬ 
fidence interval (section 2.5.2): if we calculated confidence intervals for the value of the 
odds ratio in 100 different samples, then these intervals would encompass the actual value 
of the odds ratio in the population (rather than the sample) in 95 of those samples. In this 
case, we can be fairly confident that the population value of the odds ratio lies between 
1.58 and 7.63. 7 However, our sample could be one of the 5% that produces a confidence 
interval that ‘misses’ the population value. 

2.5 % 97.5 % 

(Intercept) 0.4374531 1.268674 

Interventionlntervention 1.5820127 7.625545 

Output 8.2 

The important thing about this confidence interval is that it doesn’t cross 1 (the values at 
each end of the interval are greater than 1). This is important because values greater than 
1 mean that as the predictor variable increases, so do the odds of (in this case) being cured. 
Values less than 1 mean the opposite: as the predictor variable increases, the odds of being 
cured decrease. The fact that both the lower and upper limits of our confidence interval are 
above 1 gives us confidence that the direction of the relationship that we have observed is 
true in the population (i.e., it’s likely that having an intervention compared to not increases 
the odds of being cured). If the lower limit had been below 1 then it would tell us that there 
is a chance that in the population the direction of the relationship is the opposite to what 
we have observed. This would mean that we could not trust that our intervention increases 
the odds of being cured. 


8 . 6 . 6 . 


Model 2: Intervention and Duration as predictors (D 


Now let’s return to model 2 ( eelModel.2 ), which we ran a long time ago. Recall that in 
model 2 we added the variable Duration to our model. Output 8.3 shows the output for 
the summary of this model. You can see that the b estimate for Duration is —0.008, a pretty 
small number. In addition, the probability value associated with that variable is not signifi¬ 
cant: the value of 0.964 is larger than .05. 


7 If you ever run analysis with R and another package and compare the results, you might find different confidence 
intervals. That’s because some packages use Wald test based confidence intervals, whereas R uses likelihood ratio 
based confidence intervals and thus avoids the problems of the Wald test that we identified earlier. 
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Comparing model 1 with model 2, we can see that the deviance for the two models is the 
same (144.16), suggesting that model 2 is not an improvement over model 1. In addition, 
we can see that the AIC is higher in model 2 (150.16) than model 1 (148.16), indicating 
that model 1 is the better model. 


Call: 

glm(formula = Cured ~ Intervention + Duration, family = binomial(), 
data = eelData) 

Deviance Residuals: 

Min IQ Median 3Q Max 

-1.6025 -1.0572 0.8107 0.8161 1.3095 


Coefficients: 


Estimate Std. Error z value 


(Intercept) -0.234660 
Interventionlntervention 1.233532 
Duration -0.007835 


1.220563 -0.192 
0.414565 2.975 
0.175913 -0.045 


Pr(>|z|) 
0.84754 
0.00293 
0.96447 


* * 


Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ' 1 

(Dispersion parameter for binomial family taken to be 1) 

Null deviance: 154.08 on 112 degrees of freedom 
Residual deviance: 144.16 on 110 degrees of freedom 
AIC: 150.16 

Number of Fisher Scoring iterations: 4 

Output 8.3 

We can again compare the models by finding the difference in the deviance statistics. 
This difference is chi-square distributed. We can find this difference in two ways. First, we 
can subtract one deviance from the other as we did before. An easier method though, is to 
use the anova() function (see section 7.8.4). The anova() function has the advantage that 
it also calculates the degrees of freedom for us; but the disadvantage is that it doesn’t cal¬ 
culate the significance. If we do the calculations manually we can use the same commands 
as before, except that rather than using the null.deviance and df.null variables, we use the 
deviance and df.residual variables for the two models we’re comparing: in each case we 
subtract model 2 from model 1: 

modelChi <- eelModel.l$deviance - eelModel,2$deviance 
chidf <- eelModel.l$df.residual - eelModel,2$df.residual 
chisq.prob <- 1 - pchisq(modelChi, chidf) 
modelChi; chidf; chisq.prob 

[1] 0.001983528 
[1] 1 

[1] 0.9644765 

You should find that the difference between the models ( modelChi ) is 0.00198, with one 
degree of freedom [chidf), and a p-value ( chisq.prob ) of 0.964. As this value is greater than 
.05, we can conclude that model 2 (with Intervention and Duration as predictors) is not a 
significant improvement over model 1 (which included only Intervention as a predictor). 

With the anova() function, remember that we simply list the models in the order in which 
we want to compare them. Therefore, to compare our two models we would execute: 

anovaCeelModel.1, eelModel.2) 
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This produces Output 8.4, which confirms (but was easier to do) that the difference 
between the models is 0.00198. 

Analysis of Deviance Table 


Model 1: 
Model 2: 
Resid. 

1 

2 


Cured ~ Intervention 
Cured ~ Intervention + Duration 
Df Resid. Dev Df Deviance 
111 144.16 

110 144.16 1 0.0019835 


Output 8.4 



• The overall fit of the final model is shown by the deviance statistic and its associated chi-square statistic. If the significance 
of the chi-square statistic is less than .05, then the model is a significant fit to the data. 

• Check the table labelled Coefficients: to see which variables significantly predict the outcome. 

• For each variable in the model, look at the z-statistic and its significance (which again should be below .05). 

• More important, though, use the odds ratio for interpretation. You can obtain this using exp(model$coefficients), where model 
is the name of your model. If the value is greater than 1 then as the predictor increases, the odds of the outcome occur¬ 
ring increase. Conversely, a value less than 1 indicates that as the predictor increases, the odds of the outcome occurring 
decrease. For the aforementioned interpretation to be reliable the confidence interval of the odds ratio should not cross 1. 


8.6.7. 


Casewise diagnostics in logistic regression © 


8.6.7.1. Obtaining residuals © 

As with linear regression, it is possible to calculate residuals (see section 7.7.1.1). These 
residual variables can then be examined to see how well the model fits the observed data. 
The commands to obtain residuals are the same as those we encountered for linear regres¬ 
sion in section 7.9. To obtain residuals, we can use the resid() function and include the 
model name within it. 

Fitted values for logistic regression are a little different from linear regression. The fitted 
values are the predicted probabilities of Y occurring given the values of each predictor for 
a given participant. As such, they are derived from equation (8.4) for a given case. We can 
also calculate a predicted group membership, based on the most likely outcome for each 
person based on the model. The group memberships are based on the predicted probabili¬ 
ties, and I will explain these values in more detail when we consider how to interpret the 
residuals. Predicted probabilities are obtained with the fitted() function (again, we simply 
supply the model name to the function). 

As with ordinary regression, then, we can add these casewise diagnostic variables to our 
dataframe by creating new variables to contain them and then using the various functions 
we encountered in section 7.9 to populate these variables with the appropriate values. For 
example, as a basic set of diagnostic statistics we might execute: 8 


8 You might want to save the file after creating these variables by executing: 

write.tableCeelData, "Eel With Diagnostics.dat", sep = "\t", row.names = FALSE) 
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eelData$predicted.probabilities<-fittedCeelModel.1) 
eelData$standardized.residuals<-rstandard(eelModel.1) 
eelData$studentized.residuals<-rstudent(eelModel.1) 
eelData$dfbeta<-dfbeta(eelModel.1) 
eelData$dffit<-dffitsCeelModel.1) 
eelData $leverage<-hatvalues(eelModel.1) 

To reiterate a point from the previous chapter, running a regression without checking 
how well the model fits the data is like buying a new pair of trousers without trying them 
on - they might look fine on the hanger but get them home and you find you’re Johnny- 
tight-pants. The trousers do their job (they cover your legs and keep you warm) but they 
have no real-life value (because they cut off the blood circulation to your legs, which then 
have to be amputated). Likewise, regression does its job regardless of the data - it will cre¬ 
ate a model - but the real-life value of the model may be limited (see section 7.7). 


8.6.7.2. Predicted probabilities (D 


Let’s take a look at the predicted probabilities. We can use the headQ function again just to 
look at the first few cases. Execute: 

headCeelData[, c("Cured", "Intervention", "Duration", "predicted, 
probabilities")]) 

This command uses headQ to display the first six cases, and we have selected a subset of 
variables from the eelData dataframe (see section 3.9.1). 



Cured 

Intervention 

Duration 

predicted.probabilities 

1 Not 

Cured 

No Treatment 

7 

0.4285714 

2 Not 

Cured 

No Treatment 

7 

0.4285714 

3 Not 

Cured 

No Treatment 

6 

0.4285714 

4 

Cured 

No Treatment 

8 

0.4285714 

5 

Cured 

Intervention 

7 

0.7192982 

6 

Cured 

No Treatment 

6 

0.4285714 

Output 8.5 





Output 8.5 shows the values of the predicted probabilities as well as the initial data. We 
found from the model that the only significant predictor of being cured was having the 
intervention. This could have a value of either 1 (have the intervention) or 0 (no interven¬ 
tion). If these two values are placed into equation (8.4) with the respective regression coef¬ 
ficients, then the two probability values are derived. In fact, we calculated these values as 
part of equations (8.15) and (8.16), and you should note that the calculated probabilities - 
P(Cured) in these equations - correspond to the values of the predicted probabilities. These 
values tells us that when a patient is not treated (Intervention = 0, No Treatment), there 
is a probability of .429 that they will be cured - basically, about 43% of people get better 
without any treatment. However, if the patient does have the intervention (Intervention 
= 1, yes), there is a probability of .719 that they will get better - about 72% of people 
treated get better. When you consider that a probability of 0 indicates no chance of getting 
better, and a probability of 1 indicates that the patient will definitely get better, the values 
obtained provide strong evidence that having the intervention increases your chances of 
getting better (although the probability of recovery without the intervention is still not 
bad). 

Assuming we are content that the model is accurate and that the intervention has some 
substantive significance, then we could conclude that our intervention (which, to remind 
you, was putting an eel up the anus) is the single best predictor of getting better (not being 
constipated). Furthermore, the duration of the constipation pre-intervention and its inter¬ 
action with the intervention did not significantly predict whether a person got better. 
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8.6.7.3. Interpreting residuals (D 

Our conclusions so far are fine in themselves, but to be sure that the model is a good one, 
it is important to examine the residuals. 

We saw in the previous chapter that the main purpose of examining residuals in any 
regression is to (1) isolate points for which the model fits poorly, and (2) isolate points 
that exert an undue influence on the model. To assess the former we examine the residuals, 
especially the studentized residuals, standardized residuals and deviance statistics. To assess 
the latter we use influence statistics such as Cook’s distance, DFBeta and leverage statistics. 
These statistics were explained in detail in section 7.7 and their interpretation in logistic 
regression is the same; for more detail consult the previous chapter. To remind you of the 
main ones, Table 8.1 summarizes them. 

If you have saved your residuals in the dataframe then you could look at them by execut¬ 
ing something like: 

eelData[, c("leverage", "studentized.residuals", "dfbeta")] 

This command will print the leverage, studentized residuals and dfbeta values for model. 



OLIVER TWISTED 

Please Sir, can I have 
some more ... diagnostics? 


‘What about the trees?’ protests eco-warrior Oliver. These out¬ 
puts take up so much room, why don't you put them on the 
website instead?’ It’s a valid point so I have produced a table of 
the diagnostic statistics for this example, but it’s in the addi¬ 
tional material for this chapter on the companion website. 


The basic residual statistics for this example (leverage, studentized residuals and DFBeta 
values) are pretty good: note that all cases have DFBetas less than 1, and leverage statistics 
are very close to the calculated expected value of 0.018. All in all, this means that there are 
no influential cases having an effect on the model. The studentized residuals all have values 
of less than +2 and so there seems to be very little here to concern us. 

You should note that these residuals are slightly unusual because they are based on a 
single predictor that is categorical. This is why there isn’t a lot of variability in the values of 


Table 8.1 Summary of residual statistics 


Name 

Comment 

Leverage 

Lies between 0 (no influence) and 1 (complete influence). 

The expected leverage is (k +1)/A/, where k is the number of 
predictors and N is the sample size. In this case it would be 

2/113 = .018 

Studentized residual 
Standardized residual 

Only 5% should lie outside ± 1.96, and about 1 % should lie 
outside ±2.58. Cases above 3 are cause for concern and cases 
close to 3 warrant inspection 

DFBeta for the constant 
DFBeta for the first predictor 

(Intervention) 

Should be less than 1 
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the residuals. Also, if substantial outliers or influential cases had been isolated, you are not 
justified in eliminating these cases to make the model fit better. Instead these cases should 
be inspected closely to try to isolate a good reason why they were unusual. It might simply 
be an error in inputting data, or it could be that the case was one which had a special reason 
for being unusual: for example, there were other medical complications that might contrib¬ 
ute to the constipation that were noted during the patient’s assessment. In such a case, you 
may have good reason to exclude the case and duly note the reasons why. 



• Look at standardized residuals and check that no more than 5% of cases have absolute values above 2, and that no more 
than about 1% have absolute values above 2.5. Any case with a value above about 3 could be an outlier. 

• Calculate the average leverage (the number of predictors plus 1, divided by the sample size) and then look for values greater 
than twice or three times this average value. 

• Look for absolute values of DFBeta greater than 1. 


8 . 6 . 8 . 


Calculating the effect size © 


We’ve already seen that we can use the odds ratio (see section 8.3.6) as an effect size 
measure. 


8.7. How to report Logistic regression © 


My personal view is that you should report logistic regression much the same as linear 
regression (see section 7.11). I’d be inclined to tabulate the results, unless it’s a very 
simple model. As a bare minimum, report the beta values and their standard errors 
and significance value and some general statistics about the model (such as the R 2 and 
goodness-of-fit statistics). I’d also highly recommend reporting the odds ratio and its 
confidence interval. If you include the constant, readers of your work can construct the 
full regression model if they need to. You might also consider reporting the variables 
that were not significant predictors because this can be as valuable as knowing about 
which predictors were significant. 

For the example in this chapter we might produce a table like that in Table 8.2. Hopefully 
you can work out from where the values came by looking back through the chapter so 
far. As with multiple regression, I’ve rounded off to 2 decimal places throughout; for the 
R 2 and p-values, in line with APA convention, there is no zero before the decimal point 
(because these values cannot exceed 1) but for all other values less than 1 the zero is pres¬ 
ent; the significance of the variable is denoted by an asterisk with a footnote to indicate the 
significance level being used. 
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Table 8.2 How to report logistic regression 



B(SE) 


95% Cl for odds ratio 




Lower 

Odds ratio 

Upper 

Included 

Constant 

-0.29 





(0.27) 




Intervention 

1.23* 

(0.40) 

1.56 

3.42 

7.48 


Note. 7? 2 = .06 (Hosmer-Lemeshow), .08 (Cox-Snell), .11 (Nagelkerke). Model x 2 (1) = 9.93, 
p < .01. * p < .01. 


8.8. Testing assumptions: another example © 



This example was originally inspired by events in the soccer World Cup of 1998 (a long 
time ago now, but such crushing disappointments are not easily forgotten). Unfortunately 
for me (being an Englishman), I was subjected to watching England get knocked out of the 
competition by losing a penalty shootout. Reassuringly, six years later I watched England 
get knocked out of the European Championship in another penalty shootout. Even 
more reassuring, a few years after that I saw them fail to even qualify for the European 
Championship (not a penalty shootout this time, just playing like cretins). 

Now, if I were the England coach, I’d probably shoot the spoilt overpaid 
prima donnas, or I might be interested in finding out what factors predict 
whether or not a player will score a penalty. Those of you who hate soccer can 
read this example as being factors that predict success in a free throw in bas¬ 
ketball or netball, a penalty in hockey or a penalty kick in rugby or field goal in 
American football. Now, this research question is perfect for logistic regression 
because our outcome variable is a dichotomy: a penalty can be either scored 
or missed. Imagine that past research (Eriksson, Beckham, & Vassell, 2004; 
Hoddle, Batty, & Ince, 1998) had shown that there are two factors that reli¬ 
ably predict whether a penalty kick will be missed or scored. The first factor 
is whether the player taking the kick is a worrier (this factor can be measured 
using a measure such as the Penn State Worry Questionnaire, PSWQ). The sec¬ 
ond factor is the player’s past success rate at scoring (so whether the player has a good track 
record of scoring penalty kicks). It is fairly well accepted that anxiety has detrimental effects 
on the performance of a variety of tasks and so it was also predicted that state anxiety might 
be able to account for some of the unexplained variance in penalty success. 

This example is a classic case of building on a well-established model, because two predic¬ 
tors are already known and we want to test the effect of a new one. So, 75 soccer players 
were selected at random and before taking a penalty kick in a competition they were given a 
state anxiety questionnaire to complete (to assess anxiety before the kick was taken). These 
players were also asked to complete the PSWQ to give a measure of how much they worried 
about things generally, and their past success rate was obtained from a database. Finally, a 
note was made of whether the penalty was scored or missed. The data can be found in the 
file penalty.dat, which contains four variables - each in a separate column: 


• Scored: This variable is our outcome and it is coded such that 0 = penalty missed and 
1 = penalty scored. 

• PSWQ: This variable is the first predictor variable and it gives us a measure of the 
degree to which a player worries. 
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• Previous: This variable is the percentage of penalties scored by a particular player in 
their career. As such, it represents previous success at scoring penalties. 

• Anxious: This variable is our third predictor and it is a variable that has not previ¬ 
ously been used to predict penalty success. It is a measure of state anxiety before 
taking the penalty. 



SELF-TEST 

s We learnt how to do hierarchical regression in the 
previous chapter with linear regression and in this 
chapter with logistic regression. Try to conduct a 
hierarchical logistic regression analysis on these 
data. Enter Previous and PSWQ in the first block 
and Anxious in the second. There is a full guide on 
how to do the analysis and its interpretation in the 
additional material on the companion website. 



8 . 8 . 1 . 


Testing for multicollinearity (D 


First, if you haven’t already, read the data into a new dataframe, which we’ll call penalty- 
Data, by setting your working directory to the location of the file (see section 3.4.4) and 
executing: 

penaltyData<-read.delim("penalty.dot", header = TRUE) 

In section 7.7.2.4 we saw how multicollinearity can affect the standard error parameters 
of a regression model. Logistic regression is just as prone to the biasing effect of collinearity 
and it is essential to test for collinearity following a logistic regression analysis. We look for 
collinearity in logistic regression in exactly the same way we look for it in linear regression. 
First, let’s re-create the model with all three predictors from the self-help task (in case you 
haven’t done it). We can create the model by executing: 

penaltyModel.2 <- glm(Scored ~ Previous + PSWQ + Anxious, data = penaltyData, 
family = binomialO) 

This command creates a model ( penaltyModel.2 ) in which the variable Scored is predicted 
from PSWQ, Anxious, and Previous (Scored ~ Previous + PSWQ + Anxious). Having cre¬ 
ated this model, we can get the VIF and tolerance as we did in Chapter 7 by entering the 
model name into the vif() function from the car package. Execute: 

viffpenaltyModel.2) 

1/vif(penaltyModel.2) 

The first line gives you the VIF values and the second the tolerance (which is simply the 
reciprocal of the VIF). 

Previous PSWQ Anxious 

35.227113 1.089767 35.581976 

Previous PSWQ Anxious 

0.02838723 0.91762767 0.02810412 


Output 8.6 
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The results are shown in Output 8.6. These values indicate that there is a problem of 
collinearity: a VIF over 10 is usually considered problematic (see section 7.7.2.4 for more 
details). The result of this analysis is pretty clear-cut: there is collinearity between state 
anxiety and previous experience of taking penalties, and this dependency results in the 
model becoming biased. 



SELF-TEST 

s Using what you learned in Chapter 6, carry out a 
Pearson correlation between all of the variables 
in this analysis. Can you work out why we have a 
problem with collinearity? 


If you have identified collinearity then, unfortunately, there’s not much that you can 
do about it. One obvious solution is to omit one of the variables (so for example, we 
might stick with the model from block 1 that ignored state anxiety). The problem with 
this should be obvious: there is no way of knowing which variable to omit. The result¬ 
ing theoretical conclusions are meaningless because, statistically speaking, any of the 
collinear variables could be omitted. There are no statistical grounds for omitting one 
variable over another. Even if a predictor is removed, Bowerman and O’Connell (1990) 
recommend that another equally important predictor that does not have such strong 
multicollinearity replace it. They also suggest collecting more data to see whether the 
multicollinearity can be lessened. Another possibility when there are several predictors 
involved in the multicollinearity is to run a factor analysis on these predictors and to 
use the resulting factor scores as a predictor (see Chapter 17). The safest (although 
unsatisfactory) remedy is to acknowledge the unreliability of the model. So, if we were 
to report the analysis of which factors predict penalty success, we might acknowledge 
that previous experience significantly predicted penalty success in the first model, but 
propose that this experience might affect penalty taking by increasing state anxiety. 
This statement would be highly speculative because the correlation between Anxious 
and Previous tells us nothing of the direction of causality, but it would acknowledge the 
inexplicable link between the two predictors. I’m sure that many of you may find the 
lack of remedy for collinearity grossly unsatisfying - unfortunately statistics is frustrat¬ 
ing sometimes! 


8 . 8 . 2 . 


Testing for linearity of the logit (D 


In this example we have three continuous variables, therefore we have to check that each 
one is linearly related to the log of the outcome variable (Scored). I mentioned earlier in 
this chapter that to test this assumption we need to run the logistic regression but include 
predictors that are the interaction between each predictor and the log of itself (Hosmer 
&C Lemeshow, 1989). We need to create the interaction terms of each of the variables 
with its log, using the log() function (section 5.8.3.2). First, let’s do the PSWQ variable; 
well call the interaction of PSWQ with its log logPSWQInt, and we create this variable by 
executing: 

penaltyData$logPSWQInt <- log(penaltyData$PSWQ)*penaltyData$PSWQ 
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This command creates a new variable called logPSWQInt in the penaltyData dataframe 
that is the variable PSWQ ( penaltyData$PSWQ ) multiplied by the log of that variable 
(. log(penaltyData $PSWQ )). 



SELF-TEST 

v' Try creating two new variables that are the natural 
logs of Anxious and Previous. (Remember that 
0 has no log, so if any of the variables have a zero, 
you’ll need to add a constant - see section 5.8.3.2.) 



The dataframe will now look like this: 


PSWQ 

Anxious 

Previous Scored 

logPSWQInt 

logAnxInt 

logPrevInt 

18 

21 

56 

Scored 

Penalty 

52.02669 

63.93497 

226.41087 

17 

32 

35 

Scored 

Penalty 

48.16463 

110.90355 

125.42316 

16 

34 

35 

Scored 

Penalty 

44.36142 

119.89626 

125.42316 

14 

40 

15 

Scored 

Penalty 

36.94680 

147.55518 

41.58883 

5 

24 

47 

Scored 

Penalty 

8.04719 

76.27329 

181.94645 

1 

15 

67 

Scored 

Penalty 

0.00000 

40.62075 

282.70702 


etc. 

Note that there are three new variables that reflect the interaction between each predictor 
and the log of that predictor. 

To test the assumption we need to redo the analysis exactly the same as before, except 
that we should put all variables in a single block (i.e., we don’t need to do it hierarchically), 
and we also need to put in the three new interaction terms of each predictor and its log. 
We create the model by executing: 

penaltyTest.l <- glm(Scored ~ PSWQ + Anxious + Previous + logPSWQInt + 
logAnxInt + logPrevInt, data=penaltyData, family=binomialQ) 
summaryCpenaltyTest.1) 

This command creates a model ( penaltyTest.l ) in which the variable Scored is predicted 
from PSWQ, Anxious, Previous and the variables we created to be the interaction of these 
variables with their logs (logPSWQInt, logAnxInt, and logPrevInt). We then use the sum¬ 
mary () function to display the model. 

Coefficients: 



Estimate 

Std. Error 

z value 

Pr(>|z | ) 

(Intercept) 

-3.57212 

15.00782 

-0.238 

0.812 

PSWQ 

-0.42218 

1.10255 

-0.383 

0.702 

Anxious 

-2.64804 

2.79283 

-0.948 

0.343 

Previous 

1.66905 

1.48005 

1.128 

0.259 

logPSWQInt 

0.04388 

0.29672 

0.148 

0.882 

logAnxInt 

0.68151 

0.65177 

1.046 

0.296 

logPrevInt 

Output 8.7 

-0.31918 

0.31687 

-1.007 

0.314 


Output 8.7 shows the part of the output that tests the assumption. We’re interested 
only in whether the interaction terms are significant. Any interaction that is significant 
indicates that the main effect has violated the assumption of linearity of the logit. All three 
interactions have significance values (the values in the column Pr(> QIJ) greater than .05, 
indicating that the assumption of linearity of the logit has been met for PSWQ, Anxious 
and Previous. 
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Labcoat Leni’s Real Research 8.1 


Mandatory suicide? 


Lacourse, E., et al. (2001). Journal of Youth and Adolescence, 30, 321-332. 


Although I have fairly eclectic tastes in music, my favourite kind of music is heavy metal. One thing that is mildly 
irritating about liking heavy metal is that everyone assumes that you’re a miserable or aggressive bastard. When 
not listening (and often while listening) to heavy metal, I spend most of my time researching clinical psychology: 

I research how anxiety develops in children. Therefore, I was literally beside myself with excitement when a few 
years back I stumbled on a paper that combined these two interests: Lacourse, Claes, and Villeneuve (2001) car¬ 
ried out a study to see whether a love of heavy metal could predict suicide risk. Fabulous stuff! 

Eric Lacourse and his colleagues used questionnaires to measure several variables: suicide risk (yes or no), 
marital status of parents (together or divorced/separated), the extent to which the person’s mother and father 
were neglectful, self-estrangement/powerlessness (adolescents who have negative self-perceptions, are bored 
with life, etc.), social isolation (feelings of a lack of support), normlessness (beliefs that socially disapproved 
behaviours can be used to achieve certain goals), meaninglessness (doubting that school is relevant to gaining 
employment) and drug use. In addition, the authors measured liking of heavy metal; they included the sub-gen¬ 
res of classic (Black Sabbath, Iron Maiden), thrash metal (Slayer, Metallica), death/black metal (Obituary, Burzum) 
and gothic (Marilyn Manson). As well as liking, they measured behavioural manifestations of worshipping these 
bands (e.g., hanging posters, hanging out with other metal fans) and vicarious music listening (whether music 
was used when angry or to bring out aggressive moods). They used logistic regression to predict suicide risk 
from these predictors for males and females separately. 

The data for the female sample are in the file Lacourse et al. (2001) Females.dat. Labcoat Leni wants you 
to carry out a logistic regression predicting Suicide_Risk from all of the predictors (forced entry). (To make 
your results easier to compare to the published results, enter the predictors in the same order as in Table 3 in 
the paper: Age, Marital_Status, Mother Negligence, Father Negligence, Self_Estrangement, 
Isolation, Normlessness, Meaninglessness, Drug Use, Metal, Worshipping, Vicarious). Create 
a table of the results. Does listening to heavy metal predict girls’ suicide? If not, what does? 

Answers are in the additional material on the companion website (or look at Table 3 in the original 
article). 



8.9. Predicting several categories: multinomial 
logistic regression ® 

I mentioned earlier that it is possible to use logistic regression to predict 
membership of more than two categories and that this is called multinomial 
logistic regression. Essentially, this form of logistic regression works in the 
same way as binary logistic regression, so there’s no need for any additional 
equations to explain what is going on (hooray!). The analysis breaks the out¬ 
come variable down into a series of comparisons between two categories (which 
helps explain why no extra equations are really necessary). For example, 
if you have three outcome categories (A, B and C), then the analysis will con¬ 
sist of two comparisons. The form that these comparisons take depends on 
how you specify the analysis: you can compare everything against your first 
category (e.g., A vs. B and A vs. C), or your last category (e.g., A vs. C and B 
vs. C), or a custom category, for example category B (e.g., B vs. A and B vs. C). In practice, 
this means that you have to select a baseline category. The important parts of the analysis 
and output are much the same as we have just seen for binary logistic regression. 
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Let’s look at an example. There has been some work looking at how men and women 
evaluate chat-up lines (Bale, Morrison, &C Caryl, 2006; Cooper, O’Donnell, Caryl, 
Morrison, & Bale, 2007). This research has looked at how the content (e.g., whether the 
chat-up line is funny, has sexual content, or reveals desirable personality characteristics) 
affects how favourably the chat-up line is viewed. To sum up this research, it has found 
that men and women like different things in chat-up lines: men prefer chat-up lines with a 
high sexual content, and women prefer chat-up lines that are funny and show 
good moral fibre. 

Imagine that we wanted to assess how successful these chat-up lines were. 

We did a study in which we recorded the chat-up lines used by 348 men 
and 672 women in a nightclub. Our outcome was whether the chat-up line 
resulted in one of the following three events: the person got no response or 
the recipient walked away, the person obtained the recipient’s phone number, 
or the person left the nightclub with the recipient. Afterwards, the chat-up 
lines used in each case were rated by a panel of judges for how funny they 
were (0 = not funny at all, 10 = the funniest thing that I have ever heard), 
sexuality (0 = no sexual content at all, 10 = very sexually direct) and whether 
the chat-up line reflected good moral values (0 = the chat-up line does not 
reflect good characteristics, 10 = the chat-up line is very indicative of good 
characteristics). For example, ‘I may not be Fred Flintstone, but I bet I could 
make your bed rock’ would score high on sexual content, low on good characteristics and 
medium on humour; ‘I’ve been looking all over for you, the woman of my dreams’ would 
score high on good characteristics, low on sexual content and low on humour (as well as 
high on cheese, had it been measured). We predict based on past research that the success 
of different types of chat-up line will interact with gender. 

This situation is perfect for multinomial regression. The data are in the file Chat-Up 
Lines.dat. There is one outcome variable (Success) with three categories (no response, 
phone number, go home with recipient) and four predictors: funniness of the chat-up 
line (Funny), sexual content of the chat-up line (Sex), degree to which the chat-up line 
reflects good characteristics (Good_Mate) and the gender of the person being chatted 
up (Female - scored as 1 = female, 0 = male). Read this data file into a dataframe called 
chatData by setting your working directory to the location of the file (see section 3.4.4) 
and executing: 

chatData<-read.delim("Chat-Up Lines.dat", header = TRUE) 



8.9.1. 


Running multinomial logistic regression in R (D 


It’s possible to use R Commander to do multinomial logistic regression, but it uses a com¬ 
mand that I think is a little less friendly than the one I prefer. Hence, in this section, you 
will need to use commands. It’s not so bad though. Honest. We are going to use a function 
called mlogit() from the package of the same name (so make sure it is installed and loaded). 



Success Funny Sex Good. 

_Mate ( 

Gender 

1 

Get Phone 

Number 

3 

7 

6 

Male 

2 Go 

Home with 

Person 

5 

7 

2 

Male 

3 

Get Phone 

Number 

4 

6 

6 

Male 

4 Go 

Home with 

Person 

3 

7 

5 

Male 

5 

Get Phone 

Number 

5 

1 

6 

Male 

6 

Get Phone 

Number 

4 

7 

5 

Male 

etc. 








Output 8.8 
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The data currently look like Output 8.8: data for each person are stored as a row (i.e., 
the data are wide format - see section 3.9). The outcome (Success) and Gender are stored 
as text, therfore R should have imported these variables as factors. We can check this by 
entering each variable into the is.factor() function: 

is.factor(chatData$Success) 
is.factor(chatData$Gender) 

You should find that you get the response TRUE for both, which means that R has imported 
these variables as factors, which is what we want. (If these variables were not factors I could 
convert them, by placing them into the as.factor() function and executing.) 

One consideration at this point is that Gender will have been imported as a factor with 
‘female’ as the baseline category (because female is before male alphabetically - see the first 
example in the chapter). All of our predictions are based on females behaving differently than 
males, so it would be better (in terms of framing out interpretation) to have ‘male’ as the 
baseline category. We saw earlier in the chapter that we can achieve this change by executing: 

chatData$Gender<-relevel(chatData$Gender, ref = 2) 

This command resets the levels of the variable Gender such that the reference or baseline 
category is the category currently set as 2 (i.e., males become the reference category). 9 

Before we can run a multinomial logistic regression, we need to get the data into a particu¬ 
lar format. Instead of having one row per person, we need to have one row per person per 
category of the outcome variable. Each row will contain TRUE if the person was assigned to 
that category, and FALSE if they weren’t. If that doesn’t make sense, you shouldn’t worry: 
first, because it will make sense in a minute; and second, because we can use the mlogit.data() 
function to convert our data into the correct format. This function takes the general form: 

newDataframe<-mlogit.data(oldDataFrame, choice = "outcome variable", shape 
= "wide'VTong") 

It actually has a quite a few more options than this, but we really need to use only the basic 
options. This function creates a new dataframe from an old dataframe (specified in the func¬ 
tion). We need to tell the function the name of the categorical outcome variable, because this 
is the variable it uses to restructure the data. In this example the outcome variable is Success. 
Finally, we tell the function the shape of our original dataframe (wide or long) - in this case 
our data are wide format. Therefore, to restructure the current data we could execute: 

mlChat <- mlogit.data(chatData, choice = "Success", shape = "wide") 

This command will create a new dataframe called mlChat (which takes a lot less typing 
than ‘multinomial logit chat-up lines’) from the existing dataframe (chatData). We tell the 
function that the outcome variable is Success ( choice = “Success”) and the format of the 
original dataframe is wide ( shape = “wide”). The new dataframe looks like Output 8.9. 


Success Funny Sex Good_Mate Gender chid 


l.Get Phone Number 

TRUE 

3 

7 

6 

Male 

1 

l.Go Home with Person 

FALSE 

3 

7 

6 

Male 

1 

1.No response/Walk Off 

FALSE 

3 

7 

6 

Male 

1 

2.Get Phone Number 

FALSE 

5 

7 

2 

Male 

2 

2.Go Home with Person 

TRUE 

5 

7 

2 

Male 

2 

2.No response/Walk Off 

FALSE 

5 

7 

2 

Male 

2 


etc. 

Output 8.9 


9 Making this change will affect the parameter estimates for the main effects, but not for the interaction terms, 
which are the effects in which we’re actually interested. 
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Now let’s compare Output 8.9 with the first six rows of the original dataframe 
( chatData ) in Output 8.8. The first person in the chatData dataframe was assigned to 
the category of ‘get phone number’. In the new dataframe, that person has been split 
into three rows, each labelled 1 on the far left (to indicate that this is person 1). Next 
to each 1 is one of the three possible outcomes: get phone number, go home, and no 
response/walk off. The third column tells us which of those events occurred. The first 
person, remember, was assigned to the ‘get phone number’ category, so in the new 
dataframe they have been given a response of TRUE next to ‘get phone number’ but 
in the next two rows (which represent the same person’s responses but to the other 
two possible outcomes) they have been assigned FALSE (because these outcomes didn’t 
happen). The next variable in mlChat is Funny. The first row in the original data 
set (Output 8.8) has a score a 3 on this variable; because this person’s data are now 
spread over three rows, the first three rows (which represent one person) score a 3 on 
this variable in mlChat (Output 8.9). Hopefully you can see how the data have been 
restructured. 

Now we are ready to run the multinomial logistic regression, using the mlogit() function. 
The mlogitl) function looks very similar to the glm() function that we met earlier in the 
chapter for logistic regression. It takes the general form: 

newModel<-mlogit(outcome ~ predictor(s), data = dataFrame, na.action = an 
action, reflevel = a number representing the baseline category for the 
outcome) 

in which: 

• newModel is an object created that contains information about the model. We can get 
summary statistics for this model by executing summary (newModel). 

• outcome is the variable that you’re trying to predict. In this example it will be the 
variable Success. 

• predictor(s) lists the variable or variables from which you’re trying to predict the 
outcome variable. 

• dataFrame is the name of the dataframe from which your outcome and predictor 
variables come. 

• na.act ion is an optional command. If you have complete data (as here) you can ignore 
it, but if you have missing values (i.e., NAs in the dataframe) then it can be useful - 
see R’s Souls’ Tip 7.1). 

• relevel is a number representing the outcome category that you want to use as a 
baseline. 

As you can see, the basic idea is the same as the lm() and glm() commands with which 
you should be familiar. However, one important difference is that we need to specify the 
reference or baseline category. 



SELF-TEST 

s Think about the three categories that we have as 
an outcome variable. Which of these categories do 
you think makes most sense to use as a baseline 
category? 
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The best option is probably “No response / Walk off”. This is referenced with number 
3 in the data (it is the third category listed in Output 8.9). We can specify this by using 
reflevel = 3 in the function. 

The next issue is what to include within the model. In this example, the main effects 
are not particularly interesting: based on past research, we don’t necessarily expect 
funny chat-up lines to be successful, but we do expect them to be more successful when 
used on women than on men. What this prediction implies is that the interaction of 
Gender and Funny will be significant. Similarly, chat-up lines with a high sexual content 
might not be successful overall, but we expect them to be relatively successful when 
used on men. Again, this means that we might not expect the Sex main effect to be 
significant, but we do expect the Sex X Gender interaction to be significant. As such, we 
need to enter these interaction terms (SexXGender and FunnyXGender) into the model. 
To evaluate these interactions we must also include the main effects. However, we are 
not particularly interested in higher-order interactions such as SexXFunnyXGender 
because we don’t (theoretically) predict the success of chat-up lines should vary across 
genders with the combination of being sexy and funny. We can, therefore, create the 
model by executing: 

chatModel <- mlogit(Success ~ 1 | Good_Mate + Funny + Gender + Sex + 

Gender:Sex + Funny:Gender, data = mlChat, reflevel = 3) 

This command looks (as I said) very like the glm() model. However, notice that instead of 
the outcome variable just being Success we write ‘ Success— 1 | ’. We won’t worry about why, 
that’s just how you do it. Then you put the formula, as with the glm() or the lm() functions. 
Notice that the model contains all main effects but just two interactions: SexXGender and 
FunnyXGender. 


8.9.2. 


Interpreting the multinomial logistic regression output 


The summary of the model can be obtained by executing: 
summary(chatModel) 




SELF-TEST 

s What does the log-likelihood measure? 


Call: 

mlogit(formula = Success ~ 1 | Good_Mate + Funny + Gender + Sex + 

GenderrSex + Funny:Gender, data = mlChat, reflevel = 3, method = "nr", 
print.level = 0) 

Frequencies of alternatives: 

No response/Walk Off Get Phone Number Go Home with Person 

0.39216 0.47549 0.13235 
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nr method 

6 iterations, 0h:0m:0s 
g'(-H) A -lg = 0.00121 

successive fonction values within tolerance limits 


Coefficients : 

altGet Phone Number 

altGo Home with Person 

altGet Phone Number:Good_Mate 

altGo Home with Person:Good_Mate 

altGet Phone Number:Funny 

altGo Home with Person:Funny 

altGet Phone Number:GenderFemale 

altGo Home with Person:GenderFemale 

altGet Phone Number:Sex 

altGo Home with Person:Sex 

altGet Phone Number:GenderFemale:Sex 

altGo Home with Person:GenderFemale:Sex 

altGet Phone Number:Funny:GenderFemale 

altGo Home with Person:Funny:GenderFemale 


Estimate 

Std. Error 

t-value 

Pr(>|t|) 


-1.783070 

0.669772 

-2.6622 

0.0077631 

* * 

-4.286354 

0.941398 

-4.5532 

5.284e-06 

* * * 

0.131840 

0.053726 

2.4539 

0.0141306 

* 

0.130019 

0.083521 

1.5567 

0.1195351 


0.139389 

0.110126 

1.2657 

0.2056135 


0.318456 

0.125302 

2.5415 

0.0110376 

* 

-1.646223 

0.796247 

-2.0675 

0.0386891 

* 

-5.626369 

1.328589 

-4.2348 

2.287e-05 

* * * 

0.276206 

0.089197 

3.0966 

0.0019577 

* * 

0.417283 

0.122083 

3.4180 

0.0006307 

* * * 

-0.348326 

0.105875 

-3.2900 

0.0010020 

* * 

-0.476639 

0.163434 

-2.9164 

0.0035409 

* * 

0.492441 

0.139992 

3.5176 

0.0004354 

* * * 

1.172404 

0.199240 

5.8844 

3.996e-09 

* * * 


Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ' 1 


Log-Likelihood: -868.74 
McFadden R A 2: 0.13816 

Likelihood ratio test : chisq = 278.52 (p.value=< 2.22e-16) 

Output 8.10 

Output 8.10 shows the model parameters. There’s a lot of information scattered about 
in there to look at. We get a log-likelihood ratio of the overall model. Remember that 
the log-likelihood is a measure of how much unexplained variability there is in the data; 
therefore, the difference or change in log-likelihood indicates how much new variance has 
been explained by the model. The chi-square test tests the decrease in unexplained vari¬ 
ance from the baseline model (if you ran that model you would find the log-likelihood 
was —1008.00) 10 to the final model ( — 868.74), which is a difference of 139.26. We need 
to multiply this by 2 to get the chi-square test (because we want to compare the —2 LL so 
we multiply by 2), which gives 278.52. This change is significant, which means that our 
final model explains a significant amount of the original variability (in other words, it’s a 
better fit than the original model). Just above the likelihood ratio test we are also given a 
McFadden R 2 , a measure of effect size. 

To help with the interpretation we can exponentiate the coefficients, using the exp() func¬ 
tion, as we did with the logistic coefficients. These coefficients are stored in a variable called 
coefficients attached to the model, so we can access them using chatModel$coefficients. To 
see the exponentiated versions of them, we could execute: 

exp(chatModel$coefficients) 


10 If you like, try this out by executing (note that all of the main effects and predictors are removed from the 
formula, so this represents a model including only the intercept): 

chatBase<-mlogit(Success ~ 1, data = mlChat, reflevel = 3) 
summary(chatBase) 
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The resulting output is a bit horrible, and we can make it nicer by asking R to print the 
variable as a dataframe by enclosing the above command in the data.frame() function: 

data.frameCexp(chatModel$coefficients)) 

The resulting odds ratios are shown in Output 8.11. 

altGet Phone Number 
altGo Home with Person 
altGet Phone Number:Good_Mate 
altGo Home with Person:Good_Mate 
altGet Phone Number:Funny 
altGo Home with Person:Funny 
altGet Phone Number:GenderFemale 
altGo Home with Person:GenderFemale 
altGet Phone Number:Sex 
altGo Home with Person:Sex 
altGet Phone Number:GenderFemale:Sex 
altGo Home with Person:GenderFemale:Sex 
altGet Phone Number:Funny:GenderFemale 
altGo Home with Person:Funny:GenderFemale 

Output 8.11 

Now let’s look at the individual parameter estimates from Outputs 8.10 and 8.11. Note 
that each predictor has two parameters associated with it. This is because these param¬ 
eters compare pairs of outcome categories. We specified No response/walk off as our 
reference category; therefore, the parts of the table outputs labelled Get Phone Number 
are comparing this category against the No response/walk off category. Similarly, the 
parts labelled Go home with person are comparing this category against the No response / 
walk off category. 

We can get confidence intervals for these coefficients using the confint() function (and 
again we exponentiate to make these confidence intervals for the odds ratios). Execute: 

exp(confint(chatModel)) 

The resulting confidence intervals are shown in Output 8.12. 


exp.chatModel.coefficients. 
0.16812128 
0.01375498 
1.14092570 
1.13885057 
1.14957104 
1.37500360 
0.19277659 
0.00360163 
1.31811957 
1.51783194 
0.70586855 
0.62086652 
1.63630634 
3.22974620 


altGet Phone Number 

altGo Home with Person 

altGet Phone Number:Good_Mate 

altGo Home with Person:Good_Mate 

altGet Phone Number:Funny 

altGo Home with Person:Funny 

altGet Phone Number:GenderFemale 

altGo Home with Person:GenderFemale 

altGet Phone Number:Sex 

altGo Home with Person:Sex 

altGet Phone Number:GenderFemale:Sex 

altGo Home with Person:GenderFemale:Sex 

altGet Phone Number:Funny:GenderFemale 

altGo Home with Person:Funny:GenderFemale 


2.5 % 
0.0452388315 
0.0021734046 
1.0268939646 
0.9668821194 
0.9263950895 
1.0755891423 
0.0404843865 
0.0002664414 
1.1066999501 
1.1948318258 
0.5735912484 
0.4506952417 
1.2436632929 
2.1856202907 


97.5 % 
0.62478988 
0.08705211 
1.26762012 
1.34140512 
1.42651186 
1.75776681 
0.91795423 
0.04868514 
1.56992797 
1.92814902 
0.86865066 
0.85529022 
2.15291265 
4.77267737 


Output 8.12 

Let’s look at the effects one by one; because we are just comparing two catego¬ 
ries the interpretation is the same as for binary logistic regression (so if you don’t 
understand my conclusions reread the start of this chapter). First let’s look at the parts 
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of the outputs for the Get phone number category compared to the No response/walk 
off category 

• Good_Mate: Whether the chat-up line showed signs of good moral fibre signifi¬ 
cantly predicted whether you got a phone number or no response, b = 0.13, p < 
.05. The odds ratio (1.141) tells us that as this variable increases, so as chat-up 
lines show one more unit of moral fibre, the change in the odds of getting a phone 
number (rather than no response) is 1.14. In short, you’re more likely to get a 
phone number than not if you use a chat-up line that demonstrates good moral 
fibre. 

• Funny: Whether the chat-up line was funny did not significantly predict whether 
you got a phone number or no response, b = 0.14, p > .05. Note that although this 
predictor is not significant, the odds ratio (1.15) is approximately the same as for the 
previous predictor (which was significant). So, the effect size is comparable, but the 
non-significance stems from a relatively higher standard error. (Note that this effect 
is superseded by the interaction with gender below.) 

• Gender: The gender of the person being chatted up significantly predicted whether 
they gave out their phone number or gave no response, b = —1.65, p < .05. This 
is the effect of females compared to males. The odds ratio tells us that as gender 
changes from male (0) to female (1) the change in the odds of giving out a phone 
number compared to not responding is 0.19. In other words, the odds of a man giv¬ 
ing out his phone number compared to not responding are 1/0.19 = 5.26 times the 
odds for a woman. Men are cheap. 

• Sex: The sexual content of the chat-up line significantly predicted whether you got a 
phone number or no response, b = 0.28, p < .01. The odds ratio tells us that as the 
sexual content increased by a unit, the change in the odds of getting a phone number 
(rather than no response) is 1.32. In short, you’re more likely to get a phone number 
than not if you use a chat-up line with high sexual content. (But this effect is super¬ 
seded by the interaction with gender.) 

• FunnyXGender: The success of funny chat-up lines depended on whether they were 
delivered to a man or a woman because in interaction these variables predicted 
whether or not you got a phone number, b = 0.49, p < .001. Bearing in mind 
how we interpreted the effect of gender above, the odds ratio tells us that as gender 
changes from male (0) to female (1) in combination with funniness increasing, the 
change in the odds of giving out a phone number compared to not responding was 
1.64. In other words, as funniness increases, women become more likely to hand out 
their phone number than men. Funny chat-up lines are more successful when used 
on women than men. 

• Sex XGender: The success of chat-up lines with sexual content depended on whether 
they were delivered to a man or a woman because in interaction these variables pre¬ 
dicted whether or not you got a phone number, b = -0.35, p < .01. Bearing in mind 
how we interpreted the interaction above (note that b is negative here but positive 
above), the odds ratio tells us that as gender changes from male (0) to female (1) in 
combination with the sexual content increasing, the change in the odds of giving out 
a phone number compared to not responding is 0.71. In other words, as sexual con¬ 
tent increases, women become less likely than men to hand out their phone number. 
Chat-up lines with a high sexual content are more successful when used on men than 


women. 
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Now let’s look at the individual parameter estimates for the Go home with person cat¬ 
egory compared to the No response/walk off category. We can interpret these effects as 
follows: 

• Good_Mate: Whether the chat-up line showed signs of good moral fibre did not sig¬ 
nificantly predict whether you went home with the date or got no response, b = 0.13, 
p > .05. In short, you’re not significantly more likely to go home with the person if 
you use a chat-up line that demonstrates good moral fibre. 

• Funny: Whether the chat-up line was funny significantly predicted whether you went 
home with the date or no response, b = 0.32, p < .05. The odds ratio tells us that 
as chat-up lines are one unit funnier, the change in the odds of going home with the 
person (rather than no response) is 1.38. In short, you’re more likely to go home with 
the person than get no response if you use a chat-up line that is funny. (This effect, 
though, is superseded by the interaction with gender below.) 

• Female: The gender of the person being chatted up significantly predicted whether 
they went home with the person or gave no response, b = -5.63, p < .001. The odds 
ratio tells us that as gender changes from male (0) to female (1) the change in the 
odds of going home with the person compared to not responding is 0.004. In other 
words, the odds of a man going home with someone compared to not responding are 
1/0.004 = 250 times more likely than for a woman. Men are really cheap. 

• Sex: The sexual content of the chat-up line significantly predicted whether you went 
home with the date or got no response, b = 0.42, p < .01. The odds ratio tells us that 
as the sexual content increased by a unit, the change in the odds of going home with 
the person (rather than no response) is 1.52: you’re more likely to go home with the 
person than not if you use a chat-up line with high sexual content. (Note that this 
effect is superseded by the interaction with gender below.) 

• Funny X Gender: The success of funny chat-up lines depended on whether they were 
delivered to a man or a woman because in interaction these variables predicted 
whether or not you went home with the date, b = 1.17, p < .001. The odds ratio 
tells us that as gender changes from female (0) to male (1) in combination with fun¬ 
niness increasing, the change in the odds of going home with the person compared to 
getting no response is 3.23. As funniness increases, women become more likely to go 
home with the person than men. Funny chat-up lines are more successful when used 
on women compared to men. 

• SexX Gender: The success of chat-up lines with sexual content depended on whether 
they were delivered to a man or a woman because in interaction these variables pre¬ 
dicted whether or not you went home with the date, b = -0.48, p < .01. The odds 
ratio tells us that as gender changes from male (0) to female (1) in combination with 
the sexual content increasing, the change in the odds of going home with the date 
compared to not responding is 0.62. As sexual content increases, women become less 
likely than men to go home with the person. Chat-up lines with sexual content are 
more successful when used on men than women. 




SELF-TEST 

s Use what you learnt earlier in this chapter to check 
the assumptions of multicollinearity and linearity of 
the logit. 
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Reporting the results 


We can report the results as with binary logistic regression using a table (see Table 8.3). 
Note that I have split the table by the outcome categories being compared, but otherwise it 
is the same as before. These effects are interpreted as in the previous section. 


Table 8.3 How to report multinomial logistic regression 


95% Cl for odds ratio 



B(SE) 

Lower 

Odds Ratio 

Upper 

Phone number vs. no response 

Intercept 

-1.78 (0.67)** 




Good Mate 

0.13 (0.05)* 

1.03 

1.14 

1.27 

Funny 

0.14 (0.11) 

0.93 

1.15 

1.43 

Female 

-1.65 (0.80)* 

0.04 

0.19 

0.92 

Sexual Content 

0.28 (0.09)** 

1.11 

1.32 

1.57 

Female x Funny 

0.49 (0.14)*** 

1.24 

1.64 

2.15 

Female x Sex 

-0.35 (0.11)* 

0.57 

0.71 

0.87 

Going home vs. no response 

Intercept 

-4.29 (0.94)*** 




Good Mate 

0.13 (0.08) 

0.97 

1.14 

1.34 

Funny 

0.32 (0.13)* 

1.08 

1.38 

1.76 

Female 

-5.63 (1.33)*** 

0.00 

0.00 

0.05 

Sexual Content 

0.42 (0.12)** 

1.20 

1.52 

1.93 

Female x Funny 

1.17 (0.20)*** 

2.19 

3.23 

4.77 

Female x Sex 

-0.48 (0.16)** 

0.45 

0.62 

0.86 


What have I discovered about statistics? © 


At the age of 10 I thought I was going to be a rock star. Such was my conviction about 
this that even today (many years on) I’m still not entirely sure how I ended up not being 
a rock star (lack of talent, not being a very cool person, inability to write songs that 
don’t make people want to throw rotting vegetables at you, are all possible explana¬ 
tions). Instead of the glitzy and fun life that I anticipated I am instead reduced to writing 
chapters about things that I don’t even remotely understand. 







356 


DISCOVERING STATISTICS USING R 


We began the chapter by looking at why we can’t use linear regression when we have 
a categorical outcome, but instead have to use binary logistic regression (two outcome 
categories) or multinomial logistic regression (several outcome categories). We then 
looked into some of the theory of logistic regression by looking at the regression equa¬ 
tion and what it means. Then we moved onto assessing the model and talked about the 
log-likelihood statistic and the associated chi-square test. I talked about different meth¬ 
ods of obtaining equivalents to R 1 in regression (Hosmer- Lemeshow, Cox-Snell and 
Nagelkerke). We also discovered the z-statistic and odds ratio. The rest of the chapter 
looked at three examples using R to carry out various logistic regressions. So, hopefully, 
you should have a pretty good idea of how to conduct and interpret a logistic regression 
by now. 

Having decided that I was going to be a rock star I put on my little denim jacket with 
Iron Maiden patches sewn onto it and headed off down the rocky road of stardom. The 
first stop was ... my school. 


R packages used in this chapter 


car 

mlogit 

R functions used in this chapter 

anovaO 

hatvalues() 

as.factor() 

head() 

binomial() 

is.factor() 

confint() 

iog() 

dfbeta() 

mlogitO 

dffits() 

mlogit.data() 

exp() 

pchisq() 

factor() 

relevel() 

fitted () 

rstandard() 

function() 

rstudent() 

gimo 

summaryO 


vifO 

Key terms that I’ve discovered 

-2 LL 

Main effect 

Binary logistic regression 

Maximum-likelihood estimation 

Chi-square distribution 

Multinomial logistic regression 

Complete separation 

Nagelkerke’s R* 

Cox and Snell’s R£ s 

Normal distribution 

Deviance 

Odds 

Hosmer and Lemeshow’s Rj| 

Odds ratio 

Interaction effect 

Polychotomous logistic regression 

Likelihood 

Suppressor effects 

Logistic regression 

Wald statistic 

Log-likelihood 

z-statistic 
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Smart Alex’s tasks 


• Task 1: A psychologist was interested in whether children’s understanding of display 
rules can be predicted from their age, and whether the child possesses a theory of 
mind. A display rule is a convention of displaying an appropriate emotion in a given 
situation. For example, if you receive a Christmas present that you don’t like, the 
appropriate emotional display is to smile politely and say ‘Thank you, Auntie Kate, 
I’ve always wanted a rotting cabbage’. The inappropriate emotional display is to start 
crying and scream ‘Why did you buy me a rotting cabbage, you selfish old bag?’ Using 
appropriate display rules has been linked to having a theory of mind (the ability to 
understand what another person might be thinking). To test this theory, children 
were given a false belief task (a task used to measure whether someone has a theory 
of mind), a display rule task (which they could either pass or fail) and their age in 
months was measured. The data are in Display.dat. Run a logistic regression to see 
whether possession of display rule understanding (did the child pass the test? - yes/ 
no) can be predicted from possession of a theory of mind (did the child pass the false 
belief task? - yes/no), age in months and their interaction. © 



• Task 2: Recent research has shown that lecturers are among the most stressed work¬ 
ers. A researcher wanted to know exactly what it was about being a lecturer that 
created this stress and subsequent burnout. She took 467 lecturers and administered 
several questionnaires to them that measured: Burnout (burnt out or not), Perceived 
Control (high score = low perceived control), Coping Style (high score = high ability 
to cope with stress), Stress from Teaching (high score = teaching creates a lot of stress 
for the person), Stress from Research (high score = research creates a lot of stress for 
the person) and Stress from Providing Pastoral Care (high score = providing pastoral 
care creates a lot of stress for the person). The outcome of interest was burnout, and 
Cooper, Sloan, and Williams’s (1988) model of stress indicates that perceived control 
and coping style are important predictors of this variable. The remaining predictors 
were measured to see the unique contribution of different aspects of a lecturer’s work 
to their burnout. Can you help her out by conducting a logistic regression to see 
which factors predict burnout? The data are in Burnout.dat. © 


• Task 3: A health psychologist interested in research into HIV wanted to know the fac¬ 
tors that influenced condom use with a new partner (relationship less than 1 month 
old). The outcome measure was whether a condom was used (use: condom used = 1, 
not used = 0). The predictor variables were mainly scales from the Condom Attitude 
Scale (CAS) by Sacco, Levine, Reed, and Thompson (1991): gender (gender of the 
person); safety (relationship safety, measured out of 5, indicates the degree to which 
the person views this relationship as ‘safe’ from sexually transmitted disease); sexexp 
(sexual experience, measured out of 10, indicates the degree to which previous expe¬ 
rience influences attitudes towards condom use); previous (a measure not from the 
CAS, this variable measures whether or not the couple used a condom in their previ¬ 
ous encounter: 1 = condom used, 0 = not used, 2 = no previous encounter with this 
partner); selfcon (self-control, measured out of 9, indicates the degree of self-control 
that a person has when it comes to condom use, i.e., whether they get carried away 
with the heat of the moment, or exert control); perceive (perceived risk, measured 
out of 6, indicates the degree to which the person feels at risk from unprotected sex). 
Previous research (Sacco, Rickman, Thompson, Levine, &C Reed, 1993) has shown 
that gender, relationship safety and perceived risk predict condom use. Carry out 
an appropriate analysis to verify these previous findings, and to test whether self- 
control, previous usage and sexual experience can predict any of the remaining 
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variance in condom use. (1) Interpret all important parts of the R output. (2) How 
reliable is the final model? (3) What are the probabilities that participants 12, 53 and 
75 will use a condom? (4) A female who used a condom in her previous encounter 
with her new partner scores 2 on all variables except perceived risk (for which she 
scores 6). Use the model to estimate the probability that she will use a condom in her 
next encounter. Data are in the file condom.dat. © 

Answers can be found on the companion website. 


Further reading 


Hutcheson, G., & Sofroniou, N. (1999). The multivariate social scientist. London: Sage. Chapter 4. 

Menard, S. (1995). Applied logistic regression analysis. Sage University Paper Series on Quantitative 
Applications in the Social Sciences, 07-106. Thousand Oaks, CA: Sage. (This is a fairly advanced 
text, but great nevertheless. Unfortunately, few basic-level texts include logistic regression so 
you’ll have to rely on what I’ve written!) 

Miles, J. & Shevlin, M. (2001). Applying regression and correlation: A guide for students and research¬ 
ers. London: Sage. (Chapter 6 is a nice introduction to logistic regression.) 


Interesting real research 


Bale, C., Morrison, R., & Caryl, P. G. (2006). Chat-up lines as male sexual displays. Personality and 
Individual Differences, 40(4), 655-664. 

Bemelman, M., & Hammacher, E. R. (2005). Rectal impalement by pirate ship: A case report. Injury 
Extra, 36, 508-510. 

Cooper, M., O’Donnell, D., Caryl, P. G., Morrison, R., & Bale, C. (2007). Chat-up lines as male 
displays: Effects of content, sex, and personality. Personality and Individual Differences, 43(5), 
1075-1085. 

Lacourse, E., Claes, M., & Villeneuve, M. (2001). Heavy metal music and adolescent suicidal risk. 
Journal of Youth and Adolescence, 30(3), 321-332. 

Lo, S. F., Wong, S. H., Leung, L. S., Law, I. C., & Yip, A. W C. (2004). Traumatic rectal perforation 
by an eel. Surgery, 135(1), 110-111. 





Comparing two means 





FIGURE 9.1 

My (probably) 
eighth birthday. 
From left to right: 
my brother Paul 
(who still hides 
behind cakes 
rather than have 
his photo taken), 
Paul Spreckley, 
Alan Palsey, Clair 
Sparks and me 


9.1. What will this chapter tell me? © 


Having successfully slayed audiences at holiday camps around the country, my next step 
towards global domination was my primary school. I had learnt another Chuck Berry song 
(‘Johnny B. Goode’), but also broadened my repertoire to include songs by other artists (I 
have a feeling ‘Over the Edge’ by Status Quo was one of them). 1 Needless to say, when the 
opportunity came to play at a school assembly I jumped at it. The headmaster tried to have 


1 This would have been about 1982, so just before they became the most laughably bad band on the planet. Some 
would argue that they were always the most laughably bad band on the planet, but they were the first band that 
I called my favourite band. 
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me banned, 2 but the show went on. It was a huge success (I want to reiterate my earlier 
point that 10-year-olds are very easily impressed). My classmates carried me around the 
playground on their shoulders. I was a hero. Around this time I had a childhood sweetheart 
called Clair Sparks. Actually, we had been sweethearts since before my new-found rock 
legend status. I don’t think the guitar playing and singing impressed her much, but she 
rode a motorbike (really, a little child’s one) which impressed me quite a lot; I was utterly 
convinced that we would one day get married and live happily ever after. I was utterly 
convinced, that is, until she ran off with Simon Hudson. Being 10, she probably literally 
did run off with him - across the playground. To make this important decision of which 
boyfriend to have, Clair had needed to compare two men (Andy and Simon) to see which 
one was better; sometimes in science we want to do the same thing, to compare one man 
against another to see if there is evidence that one is different from the other. Sorry, did 
I write ‘man’? I forgot the ‘e’: this chapter is about the process of comparing two means, 
not men. 


9.2. Packages used in this chapter © 


There are several packages we will use in this chapter. You will need the packages pastecs 
(for descriptive statistics), ggplotl (for graphs), WRS (for robust methods) and of course 
Rcmdr (R Commander) if you’re going to use that rather than commands (see section 3.6). 
If you don’t have these packages installed you’ll need to install them by executing: 

install.packages("ggplot2"); install. packages("pastecs") ; install.packages 
("WRS") 

Then you need to load the packages by executing these commands: 
library(ggplot2); library(pastecs); library(WRS) 


9.3. Looking at differences © 


Rather than looking at relationships between variables, researchers are sometimes inter¬ 
ested in looking at differences between groups of people. In particular, in experimental 
research we often want to manipulate what happens to people so that we can make causal 
inferences. For example, if we take two groups of people and randomly assign one group 
a programme of dieting pills and the other group a programme of sugar pills (which they 
think will help them lose weight) then if the people who take the dieting pills lose more 
weight than those on the sugar pills we can infer that the diet pills caused the weight loss. 
This is a powerful research tool because it goes one step beyond merely observing vari¬ 
ables and looking for relationships (as in correlation and regression). 3 This chapter is the 
first of many that look at this kind of research scenario, and we start with the simplest 
scenario: when we have two groups, or, to be more specific, when we want to compare 
two means. As we have seen (Chapter 1), there are two different ways of collecting data: 


2 Seriously! Can you imagine a headmaster banning a 10-year-old from assembly? By this time I had an electric 
guitar and he used to play hymns on an acoustic guitar; I can assume only that he somehow lost all perspective 
on the situation and decided that a 10-year-old blasting out some Quo in a squeaky little voice was subversive or 
something. 

3 People sometimes get confused and think that certain statistical procedures allow causal inferences and others 
don’t. This isn’t true (see Jane Superbrain Box 1.4). 
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we can either expose different people to different experimental manipulations ( between- 
group or independent design), or take a single group of people and expose them to dif¬ 
ferent experimental manipulations at different points in time ( repeated-measures design). 
Sometimes people are tempted to compare artificially created groups by, for example, 
dividing people into groups based on a median score; however, this is generally a bad idea 
(see Jane Superbrain Box 9.1). 



|ANE SUPERBRAIN 9.1 

Are median splits the devil’s work? © 


Often in research papers you see that people have ana¬ 
lysed their data using a ‘median split’. In our spider pho¬ 
bia example, this means that you measure scores on a 
spider phobia questionnaire and calculate the median. 
You then classify anyone with a score above the median 
as a ‘phobic’, and those below the median as ‘non-pho¬ 
bic’. In doing this you ‘dichotomize’ a continuous vari¬ 
able. This practice is quite common, but is it sensible? 

MacCallum, Zhang, Preacher, and Rucker (2002) 
wrote a splendid paper pointing out various problems on 
turning a perfectly decent continuous variable into a cat¬ 
egorical variable: 

1. Imagine there are four people: Peter, Birgit, Jip and 
Kiki. We measure how scared of spiders they are as 
a percentage and get Jip (100%), Kiki (60%), Peter 


(40%) and Birgit (0%). If we split these four people at 
the median (50%) then we’re saying that Jip and Kiki 
are the same (they get a score of 1 = phobic) and 
Peter and Birgit are the same (they both get a score of 
0 = not phobic). In reality, Kiki and Peter are the most 
similar of the four people, but they have been put in 
different groups. So, median splits change the origi¬ 
nal information quite dramatically (Peter and Kiki are 
originally very similar but become very different after 
the split, Jip and Kiki are relatively dissimilar originally 
but become identical after the split). 

2. Effect sizes get smaller: if you correlate two continu¬ 
ous variables then the effect size will be larger than 
if you correlate the same variables after one of them 
has been dichotomized. Effect sizes also get smaller 
in ANOVA and regression. 

3. There is an increased chance of finding spurious effects. 

So, if your supervisor has just told you to do a median 
split, have a good think about whether it is the right thing 
to do (and read MacCallum et al.'s paper). One of the 
rare situations in which dichotomizing a continuous vari¬ 
able is justified, according to MacCallum et al., is when 
there is a clear theoretical rationale for distinct categories 
of people based on a meaningful break point (i.e., not the 
median); for example, phobic versus not phobic based 
on diagnosis by a trained clinician would be a legitimate 
dichotomization of anxiety. 


H A problem with error bar graphs of repeated-measures 
designs © 


We saw in Chapter 4 that it is important to visualize group differences using error bars. 
We’re now going to look at a problem that occurs when we graph repeated-measures 
error bars. To do this, we’re going to look at an example that I use throughout this chap¬ 
ter (not because I am too lazy to think up different data sets, but because it allows me to 
illustrate various things). The example relates to whether arachnophobia (fear of spiders) 
is specific to real spiders or whether pictures of spiders can evoke similar levels of anxiety. 
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Table 9.1 Data from spiderl_ong.dat 


Participant 

Group 

Anxiety 

1 

Picture 

30 

2 

Picture 

35 

3 

Picture 

45 

4 

Picture 

40 

5 

Picture 

50 

6 

Picture 

35 

7 

Picture 

55 

8 

Picture 

25 

9 

Picture 

30 

10 

Picture 

45 

11 

Picture 

40 

12 

Picture 

50 

13 

Real Spider 

40 

14 

Real Spider 

35 

15 

Real Spider 

50 

16 

Real Spider 

55 

17 

Real Spider 

65 

18 

Real Spider 

55 

19 

Real Spider 

50 

20 

Real Spider 

35 

21 

Real Spider 

30 

22 

Real Spider 

50 

23 

Real Spider 

60 

24 

Real Spider 

39 



Twenty-four arachnophobes were used in all. Twelve were asked to play with a big hairy 
tarantula spider with big fangs and an evil look in its eight eyes. Their subsequent anxiety 
was measured. The remaining 12 were shown only pictures of the same big hairy tarantula 
and again their anxiety was measured. The data are in Table 9.1 (and spiderLong.dat if 
you’re having difficulty entering them into R yourself). Remember that each row in the 
data represents a different participant’s data. Therefore, you need a column representing 
the group to which they belonged and a second column representing their anxiety. 




SELF-TEST 

s Enter these data into a dataframe called spiderLong 
Using what you learnt in Chapter 4, plot an error bar 
graph of the spider data. 
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Table 9.2 Data from spiderWide.dat 


Participant 

Picture (anxiety score) 

Real (anxiety score) 

1 

30 

40 

2 

35 

35 

3 

45 

50 

4 

40 

55 

5 

50 

65 

6 

35 

55 

7 

55 

50 

8 

25 

35 

9 

30 

30 

10 

45 

50 

11 

40 

60 

12 

50 

39 


OK, now let’s imagine that we’d collected these data using the same participants; that is, 
all participants had their anxiety rated after seeing the real spider, but also after seeing the 
picture (in counterbalanced order obviously). The data would now be arranged differently 
in R. Instead of having a coding variable, and a single column with anxiety scores in, we 
would arrange the data in two columns (one representing the picture condition and one 
representing the real condition). The data are displayed in Table 9.2 (and spiderWide.dat 
if you’re having difficulty entering them into R yourself). Note that the anxiety scores are 
identical to the between-group data (Table 9.1) - it’s just that we’re pretending that they 
came from the same people rather than different people. 




SELF-TEST 

s Enter these data into a dataframe called spiderWide. 



Figure 9.2 shows the error bar graphs from the two different designs. Remember that 
the data are exactly the same, all that has changed is whether the design used the same par¬ 
ticipants (repeated measures) or different (independent). Now, we discovered in Chapter 
1 that repeated-measures designs eliminate some extraneous variables (such as age, IQ and 
so on) and so can give us more sensitivity in the data. Therefore, we would expect our 
graphs to be different: the repeated-measures graph should reflect the increased sensitivity 
in the design. Looking at the two error bar graphs, can you spot this difference between 
the graphs? 

I can’t either; and this is the problem. The graphs should not be the same. The moral is: 
Don’t use error bar graphs when you have repeated measures groups. Or if you do, adjust 
the data before you plot the graph (Loftus & Masson, 1994). 
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FIGURE 9.2 


Independent Design 


Repeated Measures Design 


Two error bar 
graphs of anxiety 
data in the 
presence of a 
real spider or a 
photograph. The 
data on the left are 
treated as though 
they are different 
participants, 
whereas those on 
the right are treated 
as though they 
are from the same 
participants 
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9.3.2. 


Step 1: calculate the mean for each participant © 


To correct the repeated-measures error bars requires several steps, but none of them are 
particularly difficult. To begin with, we need to calculate the average anxiety for each 
participant. We’re using the spiderWide dataframe so participants’ scores are stored in two 
columns, therefore we need to simply add these columns and divide by 2 by executing: 

spiderWide$pMean<-(spiderWide$picture + spiderWide$real)/2 

This command creates a variable called pMean in the dataframe spiderWide , by adding the 
scores for picture and real (from the same dataframe) and dividing by 2. 


9.3.3. 


Step 2: calculate the grand mean © 


The grand mean is the mean of all scores (regardless of from which condition the score 
comes) and so for the current data this value will be the mean of all 24 scores. A fairly 
simple way to calculate this value is to use the c() function, with which we’re familiar, to 
combine the picture and real variables into a single variable, and then apply the mean() 
function to this new variable. We can do this is a single command: 

grandMean<-mean(c(spiderWide$picture, spiderWide$real)) 

Executing this command creates a variable called grandMean, which is the mean of picture 
and real combined into a single variable ( c(spiderWide$picture, spiderWide$real))\ in other 
words, it’s the mean of all scores. 




Step 3: calculate the adjustment factor © 


If you look at the variable labelled pMean, you should notice that the values for each 
participant are different, which tells us that some people had greater anxiety than others 
did across the conditions. The fact that participants’ mean anxiety scores differ represents 
individual differences between different people (so it represents the fact that some of the 
participants are generally more scared of spiders than others). These differences in natural 
anxiety contaminate the error bar graphs, which is why if we don’t adjust the values that 
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we plot, we will get the same graph as if an independent design had been used. Loftus and 
Masson (1994) argue that to eliminate this contamination we should equalize the means 
between participants (i.e., adjust the scores in each condition such that when we take the 
mean score across conditions, it is the same for all participants). To do this, we need to 
calculate an adjustment factor by subtracting each participant’s mean score (pMean) from 
the grand mean (grandMean): 

spiderWide$adj<-grandMean-spiderWide$pMean 

Executing this command creates a variable adj (short for adjustment) in the spiderWide 
dataframe by taking the variable grandMean (which we just computed) and subtracting 
from it the mean anxiety for each participant (which is stored in the variable pMean, which 
we computed earlier on). The dataframe now looks like this: 



picture 

real 

pMean 

adj 

1 

30 

40 

35. 

.0 

8.5 

2 

35 

35 

35. 

.0 

8.5 

3 

45 

50 

47 . 

.5 

-4.0 

4 

40 

55 

47 . 

.5 

-4.0 

5 

50 

65 

57. 

.5 

-14.0 

6 

35 

55 

45. 

.0 

-1.5 


etc. 

There is a new variable in the data editor called adj. The scores in this column represent 
the difference between each participant’s mean anxiety and the mean anxiety level across 
all participants. You’ll notice that some of the values are positive, and these participants 
are ones who were less anxious than average. Other participants were more anxious than 
average, and they have negative adjustment scores. We can now use these adjustment values 
to eliminate the between-subject differences in anxiety. 


9.3.5. 


Step 4: create adjusted values for each variable © 


So far, we have calculated the difference between each participant’s mean score and the 
mean score of all participants (the grand mean). This difference can be used to adjust the 
existing scores for each participant. First we need to adjust the scores in the picture condi¬ 
tion. All we do is take the original score (picture) and add to it the value of the adjustment 
(adj): 

spiderWide$picture_adj<-spiderWide$picture + spiderWide$adj 

Executing this command creates a variable picture_adj in the spiderWide dataframe by add¬ 
ing the adjustment ( spiderWide $ adj ) to the original anxiety scores after seeing the picture 
(spiderWide$picture). We can do exactly the same to create adjusted values of real: 

spiderWide$real_adj<-spiderWide$real + spiderWide$adj 

Executing this command creates a variable real_adj in the spiderWide dataframe by adding 
the adjustment (spiderWide$adj) to the original anxiety scores after seeing the real spider 
(spiderWide$real). The dataframe now looks like this: 



picture 

real 

pMean 

adj 

picture_adj 

real_adj 

1 

30 

40 

35.0 

8.5 

38.5 

48.5 

2 

35 

35 

35.0 

8.5 

43.5 

43.5 

3 

45 

50 

47.5 

-4.0 

41.0 

46.0 

4 

40 

55 

47.5 

-4.0 

36.0 

51.0 

5 

50 

65 

57.5 

-14.0 

36.0 

51.0 

6 

35 

55 

45.0 

-1.5 

33.5 

53.5 


etc. 
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Now, the variables real_adj and picture_adj represent the anxiety experienced in each 
condition, adjusted so as to eliminate any between-subject differences. If you don’t believe 
me, then use the meanQ function to create a variable pMean2 that is the average of real_adj 
and picture_adj (just like we did in section 9.3.2). You should find that the value in this 
column is the same for every participant, thus proving that the between-subject variability 
in means is gone: the value will be 43.50 - the grand mean. We can also wrap all of these 
steps together in a function for use with other dataframes (R’s Souls’ Tip 9.1). 




SELF-TEST 

s Create an error bar chart of the mean of the adjusted 
values that you have just made (real_adj and 
picture_adj). 


FIGURE 9.3 

Error bar graph 
of the adjusted 
values of the 
spiderWide 
dataframe 
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The resulting error bar graph is shown in Figure 9.3. Compare this graph to the graphs 
in Figure 9.2 - what differences do you see? The first thing to notice is that the means in 
the two conditions have not changed. Flowever, the error bars have changed: they have 
got smaller. Also, whereas in Figure 9.2 the error bars overlap, in this new graph they do 
not. In Chapter 2 we discovered that when error bars do not overlap we can be fairly con¬ 
fident that our samples have not come from the same population (and so our experimental 
manipulation has been successful). Therefore, when we plot the proper error bars for the 
repeated-measures data it shows the extra sensitivity that this design has: the differences 
between conditions appear to be significant, whereas when different participants are used, 
there does not appear to be a significant difference. (Remember that the means in both 
situations are identical, but the sampling error is smaller in the repeated-measures design.) 
I expand upon this point in section 9.7. 
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How to impress your friends 


(D 


The process we went through to adjust the scores for the fact they were from a repeated-measures design 
could be wrapped up in a function (see R’s Souls’ Tip 6.2). This would enable us to apply the function to a 
dataframe, which would be useful if we wanted to adjust lots of pairs of variables. The function would look 
like this: 


rmMecmAdjust<-function(dataframe) 

{ 

varNames<-names(dataframe) 
pMean<-(dataframe[,l] + dataframe[,2])/2 
grandmean<-meanCc(dataframe[,1], dataframe[,2])) 
adj<-grandmean-pMean 
varA_adj<-dataframe[,1] + adj 
varB_adj<-dataframe[,2] + adj 
output<-data.frameCvarA_adj, varB_adj) 

names(output)<-c(paste(varNames[l], "Adj", sep = paste(varNames[2], 

"_Adj", sep = 

return(output) 

> 

Executing these commands creates a function called rmMeanAdjust which takes a dataframe as input, and out¬ 
puts a dataframe containing the adjusted scores. Let’s look at the contents of the function: 

• varNames<-names(dataframe) gets the names of the variables in the dataframe entered into the function 
and stores them in varNames. 

• pMean<-(dataframe[,1] + dataframe[,2])/2 computes pMean by adding the first and second columns of 
the dataframe and diving by 2. 

• grandmean<-mean(c(dataframe[,1], dataframe[,2])) computes the grand mean by merging the first two 
columns of the dataframe and computing the mean. 

• adj<-grandmean-pMean calculates the adjustment for each row of the dataframe by subtracting pMean 
from grandmean 

• varA_adj<-dataframe[,1] + adj creates a new variable (varA_adj) that is the first column of the dataframe 
plus the adjustment factor. 

• varB_adj<-dataframe[,2]+adj creates a new variable (varB_adj) that is the second column of the dataframe 
plus the adjustment factor. 

• output<-data.frame(varA_adj, varB_adj) binds varA_adj and varB_adj together in a dataframe named 
output. 

• names(output)<-c(paste(varNames[1], “adj", sep = paste(varNames[2], “_adj", sep = renames 
the columns of the dataframe as the name of the original variable in the original dataframe plus “_adj”. So, 
a variable called picture becomes picture_adj. 

• return(output) returns the dataframe of adjusted values. 

We can now use this function on a dataframe (remember that it will adjust the first two columns of the dataframe 
so we’re assuming that we’re entering a column dataframe with the scores for the two repeated-measures condi¬ 
tions in each column). We apply the function to the original dataframe spiderWide (the one that contained only 
picture and real) by executing: 

rmMeanAdjust(spiderWide) 
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The result is: 

picture_adj real_adj 


1 

38. 

.5 

48. 

.5 

2 

43 . 

.5 

43 . 

.5 

3 

41. 

.0 

46. 

.0 

4 

36. 

.0 

51. 

.0 

5 

36. 

.0 

51. 

.0 

6 

33 . 

.5 

53 . 

.5 

7 

46. 

.0 

41. 

.0 

8 

38. 

.5 

48. 

.5 

9 

43 . 

.5 

43 . 

.5 

10 

41. 

.0 

46. 

.0 

11 

33 . 

.5 

53 . 

.5 

12 

49. 

.0 

38. 

.0 


Pretty cool, I think you’ll agree. By ‘cool’, I mean sad, obviously. 


9.4. The f-test © 


We have seen in previous chapters that the t-test is very versatile: it can 
be used to test whether a correlation coefficient is different from 0; it can 
also be used to test whether a regression coefficient, b, is different from 0. 
However, it can also be used to test whether two group means are different. 
It is to this use that we now turn. 

The simplest form of experiment that can be done is one with only one 
independent variable that is manipulated in only two ways and only one out¬ 
come is measured. More often than not the manipulation of the independent 
variable involves having an experimental condition and a control group (see 
Field & Hole, 2003). Some examples of this kind of design are: 

• Is the movie Scream 2 scarier than the original Scream ? We could measure heart rates 
(which indicate anxiety) during both films and compare them. 

• Does listening to music while you work improve your work? You could get some 
people to write an essay (or book!) while listening to their favourite music, and then 
write a different essay while working in silence (this is a control group). You could 
then compare the essay marks. 

• Does listening to Andy’s favourite music improve your work? You could repeat the 
above but rather than letting people work with their favourite music, you could play 
them some of my favourite music (as listed in the acknowledgements) and watch the 
quality of their work plummet. 



The t-test can analyse these sorts of scenarios. Of course, there are more complex experi¬ 
mental designs and we will look at these in subsequent chapters. There are, in fact, two 
different t-tests and the one you use depends on whether the independent variable was 
manipulated using the same participants or different: 


• Independent-means t-test: This test is used when there are two experimental condi¬ 
tions and different participants were assigned to each condition (this is sometimes 
called the independent-measures or independent-samples t- test). 
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• Dependent-means t-test: This test is used when there are two experimental condi¬ 
tions and the same participants took part in both conditions of the experiment (this 
test is sometimes referred to as the matched-pairs or paired-samples t-test). 


Rationale for the f-test © 


Both t-tests have a similar rationale, which is based on what we learnt in Chapter 2 about 
hypothesis testing: 

• Two samples of data are collected and the sample means calculated. These means 
might differ by either a little or a lot. 

• If the samples come from the same population, then we expect their means to be roughly 
equal (see section 2.5.1). Although it is possible for their means to differ by chance 
alone, we would expect large differences between sample means to occur very infre¬ 
quently. Under the null hypothesis we assume that the experimental manipulation has 
no effect on the participants: therefore, we expect the sample means to be very similar. 

• We compare the difference between the sample means that we collected to the dif¬ 
ference between the sample means that we would expect to obtain if there were no 
effect (i.e., if the null hypothesis were true). We use the standard error (see section 
2.5.1) as a gauge of the variability between sample means. If the standard error is 
small, then we expect most samples to have very similar means. When the standard 
error is large, large differences in sample means are more likely. If the difference 
between the samples we have collected is larger than we would expect based on the 
standard error then we can assume one of two things: 

o There is no effect and sample means in our population fluctuate a lot and we have, by 
chance, collected two samples that are atypical of the population from which they came, 
o The two samples come from different populations but are typical of their respective 
parent population. In this scenario, the difference between samples represents a 
genuine difference between the samples (and so the null hypothesis is incorrect). 

• As the observed difference between the sample means gets larger, the more confident we 
become that the second explanation is correct (i.e., that the null hypothesis should be 
rejected). If the null hypothesis is incorrect, then we gain confidence that the two sample 
means differ because of the different experimental manipulation imposed on each sample. 

I mentioned in section 2.6.1 that most test statistics can be thought of as the ‘variance 
explained by the model’ divided by the ‘variance that the model can’t explain’. In other 
words, effect/error. When comparing two means the ‘model’ that we fit to the data (the 
effect) is the difference between the two group means. We saw also in Chapter 2 that means 
vary from sample to sample (sampling variation) and that we can use the standard error 
as a measure of how much means fluctuate (in other words, the error in the estimate of 
the mean). Therefore, we can also use the standard error of the differences between the 
two means as an estimate of the error in our model (or the error in the difference between 
means). Therefore, we calculate the t-test using the following equation: 

observed difference expected difference 

between sample — between population means 
means (if null hypothesis is true) 


estimate of the standard error of the difference between two 
sample means 


(9.1) 
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The top half of the equation is the ‘model’ - our model being that the difference between 
means (which we expect to be non-zero) is bigger than the expected difference (which 
in most cases will be zero). The bottom half is the ‘error’. So, just as I said in Chapter 2, 
we’re basically getting the test statistic by dividing the model (or effect) by the error in the 
model. The exact form that this equation takes depends on whether the same or different 
participants were used in each experimental condition. 


The f-test as a general linear model 


A lot of you might think it’s odd that a t-test can be used to test whether a correlation 
coefficient (or b in regression) is different from 0 and yet now I’m telling you that it can 
be used to test differences between two means. You might well be thinking ‘but correla¬ 
tions and bs show relationships, not differences between means - what is this fool going on 
about?’. You may be starting not to trust me, or stuffing the book in a box to post it back 
for a refund. 

I used to think this too until I read a fantastic paper by Cohen (1968), which made 
me realize what I’d been missing; the complex, thorny, weed-infested and Andy-eating- 
tarantula-inhabited world of statistics suddenly turned into a beautiful meadow filled with 
tulips and little bleating lambs all jumping for joy at the wonder of life. Actually, I’m still a 
bumbling fool trying desperately to avoid having the blood sucked from my flaccid corpse 
by the tarantulas of statistics, but it was a good paper. Recall from section 2.4.3 that all 
statistical procedures are basically the same, they’re just more or less elaborate versions of 
this simple model: 


outcome. = (model) + error. 


In Chapter 7 we saw that the t-test was used to test whether the regression coefficient 
of a predictor was equal to zero. The experimental design for which the independent t-test 
is used can be conceptualized as a regression equation (after all, there is one independent 
variable (predictor) and one dependent variable (outcome)). If we want to predict our out¬ 
come, then we can use the general equation that I’ve repeated above. 

If we want to use a linear model, then we saw that this general equation becomes equa¬ 
tion (7.2) in which the model is defined by the slope and intercept of a straight line. 
Equation (9.2) shows a very similar equation in which A is the dependent variable (out¬ 
come), b Q is the intercept, b x is the weighting of the predictor and G, is the independent 
variable (predictor). Now, I’ve also included the same equation but with some of the letters 
replaced with what they represent in the spider experiment (so A = anxiety, G = group). 
When we run an experiment with two conditions, the independent variable has only two 
values (group 1 or group 2). There are several ways in which these groups can be coded (in 
the spider example we coded group 1 with the value 0 and group 2 with the value 1). This 
coding variable is known as a dummy variable and values of this variable represent groups 
of entities. We have come across this coding in section 7.12: 

A =(A +A G ;) + e ; 

anxiety,. = (b 0 + b i group,) + e, 

Using the spider example, we know that the mean anxiety of the picture group was 40, 
and that the group variable is equal to 0 for this condition. Look at what happens when the 
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group variable is equal to 0 (the picture condition): equation (9.2) becomes (if we ignore 
the residual term) 


X picture — + (h| X 0) 



b 0 = 40 


Therefore, b 0 (the intercept) is equal to the mean of the picture group (i.e., it is the mean 
of the group coded as 0). Now let’s look at what happens when the group variable is equal 
to 1. This condition is the one in which a real spider was used, and the mean anxiety (X real ) 
of this condition was 47. Remembering that we have just found out that b Q is equal to the 
mean of the picture group (X picture ), equation (9.2) becomes 


Xrcai = b 0 + (hj xl) 

Xreal = Xpicture + h| 

b p — X real X p i cm r 

b x =47-40 
= 7 


b v therefore, represents the difference between the group means. As such, we can rep¬ 
resent a two-group experiment as a regression equation in which the coefficient of the 
independent variable (bj is equal to the difference between group means, and the inter¬ 
cept ( b 0 ) is equal to the mean of the group coded as 0. In regression, the t-test is used to 
ascertain whether the regression coefficient (bj is equal to 0, and when we carry out a 
t-test on grouped data we, therefore, test whether the difference between group means is 
equal to 0. 



SELF-TEST 

s Let me prove that I’m not making it up as I go along. 
Using the lm() function, run a regression on the data 
in spiderLong.dat with Group as the predictor and 
Anxiety as the outcome. 



The resulting R output should contain the regression summary table shown in Output 
9.1. The first thing to notice is the value of the constant ( b 0 ): its value is 40, the same as the 
mean of the base category (the picture group). The second thing to notice is that the value 
of the regression coefficient hpis 7, which is the difference between the two group means 
(47 — 40 = 7). Finally, the t-statistic, which tests whether b 1 is significantly different from 
zero, is not significant, indicating that b 1 (i.e., the difference between group means) is not 
significantly different from zero. 

Call: 

lm(formula = Anxiety ~ Group, data = spiderLong) 

Residuals: 

Min IQ Median 3Q Max 

-17.0 -8.5 1.5 8.0 18.0 
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Coefficients: 

Estimate Std. 
(Intercept) 40.000 

GroupReal Spider 7.000 


Error t value 
2.944 13.587 
4.163 1.681 


Pr(>|t|) 

3.53e-12 *** 
0.107 


Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 


1 


Residual standard error: 10.2 on 22 degrees of freedom 
Multiple R-squared: 0.1139, Adjusted R-squared: 0.07359 

F-statistic: 2.827 on 1 and 22 DF, p-value: 0.1068 

Output 9.1 

Although we have looked at the situation in which the different groups are independent 
(i.e., different entities are tested in different conditions) repeated-measures designs can be 
conceptualized in much the same way; however, because in this situation data points will 
not be independent it’s more complicated to explain how it works (and also unnecessary). 
However, we will get into this subject more in Chapters 13, 14 and 19. For now, I hope to 
have demonstrated that differences between means can be represented in terms of linear 
models. If you have understood this section, then you are well on your way to understand¬ 
ing the next six chapters of this book. 


Assumptions of the f-test © 


Given that the t-test is basically regression, it has much the same assumptions. Both the 
independent t-test and the dependent t-test are parametric tests based on the normal distri¬ 
bution (see Chapter 5). Therefore, they assume: 

• The sampling distribution is normally distributed. In the dependent t-test this means 
that the sampling distribution of the differences between scores should be normal, not 
the scores themselves (see section 9.6.3.4). 

• Data are measured at least at the interval level. 

The independent t-test, because it is used to test different groups of people, also assumes: 

• Scores in different treatment conditions are independent (because they come from 
different people). 

• Homogeneity of variance - well, at least in theory we assume equal variances, but in 
reality we don’t (Jane Superbrain Box 9.2). 

These assumptions were explained in detail in Chapter 5 and, in that chapter, I empha¬ 
sized the need to check these assumptions before you reach the point of carrying out your 
statistical test. Let’s now look at each of the two t-tests in more detail. 


9.5. The independent f-test © 

| The independent f-test equation explained © 


We’ll stick with the situation in which different entities have been tested in the different 
conditions of your experiment. This is a situation in which the independent t-test is used. 
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JANE SUPERBRAIN 9.2 

What about the assumption of 
homogeneity of variance? (D 

You might have read about homogeneity of variance as 
being an assumption that is made by the independent 
f-test. It is the same assumption that we came across 
in regression, as the homoscedasticity assumption, and 
statisticians used to recommend testing for it (using 
Levene’s test) and if the assumption was violated, use 
an adjustment to correct for it. However, more recently 


statisticians have stopped using this approach, for two 
reasons. First, violating this assumption only matters if 
you have unequal group sizes; if you don’t have unequal 
group sizes, the assumption is pretty much irrelevant and 
can be ignored. Second, the tests of homogeneity of vari¬ 
ance tend to work very well when you have equal group 
sizes and large samples (when it doesn’t matter as much 
if you have violated the assumption) and don’t work as 
well with unequal group sizes and smaller samples - 
which is exactly when it matters. 

Plus, there is an adjustment (called Welch’s f-test) 
which is able to correct for violation of this assumption - 
it’s quite hard to do if you have to do it by hand, but very 
easy to do if you have a computer. If you have violated 
the assumption, a correction is made - and if you haven’t 
violated the assumption, a correction is not made, so you 
might as well always do Welch’s f-test and forget about 
the assumption. If you’re really interested in this, I like the 
article by Zimmerman (2004). 


If you choose not to think about the f-test and calculating the f-statistic as a form of regres¬ 
sion, then you can think of it in terms of two equations that differ depending on whether 
the samples contain an equal number of people. We can calculate the f-statistic by using a 
numerical version of equation (9.1); in other words, we are comparing the model or effect 
against the error. When different participants participate in different conditions, pairs of 
scores will differ not just because of the experimental manipulation, but also because of 
other sources of variance (such as individual differences between participants’ motivation, 
IQ, etc.). Therefore, we make comparisons on a per condition basis (by looking at the 
overall effect in a condition): 

. _ (Xi — x 2 ) ~~ (at — m-2 ) 

estimate of the standard error ^ ^ 

Instead of looking at differences between pairs of scores, we now look at differences 
between the overall means of the two samples and compare them to the differences we 
would expect to get between the means of the two populations from which the samples 
come. If the null hypothesis is true then the samples have been drawn from the same popu¬ 
lation. Therefore, under the null hypothesis = /z 2 and therefore, /z 1 — /z 2 = 0. Therefore, 
under the null hypothesis the equation becomes 


t = _ (X t -X 2 ) _ 

estimate of the standard error ^ ^ 

For the independent t-test we are looking at differences between groups and so we divide 
by the standard deviation of differences between groups. We can apply the logic of sampling 
distributions to this situation. Now, imagine we took several pairs of samples - each pair con¬ 
taining one sample from the two different populations - and compared the means of these 
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samples. From what we have learnt about sampling distributions, we know that the majority 
of samples from a population will have fairly similar means. Therefore, if we took several 
pairs of samples (from different populations), the differences between the sample means will 
be similar across pairs. However, often the difference between a pair of sample means will 
deviate by a small amount and very occasionally it will deviate by a large amount. If we could 
plot a sampling distribution of the differences between every pair of sample means that could 
be taken from two populations, then we would find that it had a normal distribution with a 
mean equal to the difference between population means (/j . 1 — /i,). The sampling distribution 
would tell us by how much we can expect the means of two (or more) samples to differ. As 
before, the standard deviation of the sampling distribution (the standard error) tells us how 
variable the differences between sample means are by chance alone. If the standard deviation 
is high then large differences between sample means can occur by chance; if it is small then 
only small differences between sample means are expected. It, therefore, makes sense that we 
use the standard error of the sampling distribution to assess whether the difference between 
two sample means is statistically meaningful or simply a chance result. Specifically, we divide 
the difference between sample means by the standard deviation of the sampling distribution. 

So, how do we obtain the standard deviation of the sampling distribution of differences 
between sample means? Well, we use the variance sum law, which states that the variance of 
a difference between two independent variables is equal to the sum of their variances (see, 
for example, Howell, 2006). This statement means that the variance of the sampling distri¬ 
bution is equal to the sum of the variances of the two populations from which the samples 
were taken. We saw earlier that the standard error is the standard deviation of the sampling 
distribution of a population. We can use the sample standard deviations to calculate the 
standard error of each population’s sampling distribution: 


SE of sampling distribution of population 1 





SEof sampling distribution of population 2 = . 2 

V N 2 


Therefore, remembering that the variance is simply the standard deviation squared, we can 
calculate the variance of each sampling distribution: 

variance of sampling distribution of population 1 = 


variance of sampling distribution of population 2 = 

The variance sum law means that to find the variance of the sampling distribution of dif¬ 
ferences we merely add together the variances of the sampling distributions of the two 
populations: 

-2 -2 
Si s 2 

var iance of sampling distribution of differences = 


\ 2 


( 

v/nT 


\ 2 


J 1 

N, 


To find out the standard error of the sampling distribution of differences we merely take 
the square root of the variance (because variance is the standard deviation squared): 



SEof samp ling distribution of differences = 
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Therefore, equation (9.4) becomes: 


X a -X 2 



Equation (9.5) is true only when the sample sizes are equal. Often in science it is not 
possible to collect samples of equal size (because, for example, people may not complete 
an experiment). When we want to compare two groups that contain different numbers of 
participants then equation (9.5) is not appropriate. Instead the pooled variance estimate 
7-test is used, which takes account of the difference in sample size by weighting the vari¬ 
ance of each sample. We saw in Chapter 1 that large samples are better than small ones 
because they more closely approximate the population; therefore, we weight the variance 
by the size of sample on which it’s based (we actually weight by the number of degrees of 
freedom, which is the sample size minus 1). Therefore, the pooled variance estimate is: 


2 _ {n 1 - l)Sj 2 + {n 2 - l)s 2 
p n 1 + n 2 - 2 


This is simply a weighted average in which each variance is multiplied (weighted) by its 
degrees of freedom, and then we divide by the sum of weights (or sum of the two degrees 
of freedom). The resulting weighted average variance is then just replaced in the 7-test 
equation: 


t = 


x,-x 2 



We can compare the obtained value of t against the maximum value we would expect 
to get by chance alone in a 7-distribution with the same degrees of freedom (these val¬ 
ues can be found in the Appendix); if the value we obtain exceeds this critical value we 
can be confident that this reflects an effect of our independent variable. One thing that 
should be apparent from the equation for t is that to compute it you don’t actually need 
any raw data. All you need are the means, standard deviations and sample sizes (see R’s 
Souls’ Tip 9.2). 

The derivation of the 7-statistic is merely to provide a conceptual grasp of what we are 
doing when we carry out a 7-test using R. Therefore, if you don’t know what on earth I’m 
babbling on about then don’t worry about it (just spare a thought for my cat: he has to 
listen to this rubbish all the time) because R knows how to do it and that’s all that matters. 


9.5.2. 


Doing the independent f-test © 


I have probably bored most of you to the point of wanting to eat your own legs by now. 
Equations are boring and that is why R was invented to help us minimize our contact with 
them. Using our spider data again (spiderLong.dat), we have 12 arachnophobes who were 
exposed to a picture of a spider and 12 different spider-phobes who were exposed to a 
real-life tarantula (the groups are coded using the variable Group). Their anxiety was mea¬ 
sured in each condition (Anxiety). I have already described how the data are arranged (see 
section 9.2), so we can move straight onto doing the test itself. 
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Computing t from means, SDs and Ns 


<D 


You can compute a f-test in R from only the two group means, the two group standard deviations and the two 
group sizes. First we’ll calculate: xl (mean of group 1), x2 (mean of group 2), sdl (standard deviation of group 
1), sd2 (standard deviation of group 2), nl (sample size of group 1) and n2 (sample size of group 2). 


xl <- mecin(spiderLong[spiderLong$Group=="Real Spider",]$Anxiety) 
x2 <- mean(spiderLong[spiderLong$Group=="Picture",]$Anxiety) 
sdl <- sd(spiderLong[spiderLong$Group=="Real Spider",]$Anxiety) 
sd2 <- sd(spiderLong[spiderLong$Group=="Picture",]$Anxiety) 
nl <- length(spiderLong[spiderLong$Group=="Real Spider",]$Anxiety) 
n2 <- length(spiderLong[spiderLong$Group=="Picture",]$Anxiety) 


Now we can calculate the f-test by writing and executing a function (see R’s Souls’ Tip 6.2) 

ttestfromMeans<-function(xl, x2, sdl, sd2, nl, n2) 

{ 

df<-nl + n2 - 2 

poolvar<-(((nl-l)*sdl A 2)+((n2-l)*sd2 A 2))/df 

t<-(xl-x 2 )/sqrtCpoolvar*((l/nl)+(l/n 2 ))) 

sig<- 2 *(l-(ptCabs(t),df))) 

paste("tCdf = ", df, ") = ", t, ", p = ", sig, sep = "") 

> 


Executing these commands creates a function called ttestfromMeans which takes the means, standard devia¬ 
tions and sample sizes of the two groups, and outputs the resulting t-test to compare those two means. Let’s look 
at the contents of the function: 


• df<-n1 +n2 - 2 computes the degrees of freedom. 

• poolvar<-(((n1-1)*sd1 ''2)+((n2-1)*sd2^2))/df computes the pooled variance estimate, s 2 

• f <-(x1-x2)/sqrt(poolvar*((1/n1)+(1/n2))) computes the f-statistic. 

• sig<-2*(1-(pt(abs(t),df))) calculates thep-value. 

• paste("t(df= ", df, ") = ", t, ", p = ", sig, sep = " ") pastes together some text and the values of f, df and 
the p-value to print to the console. 

We can now use this function on the means, standard deviations and sample sizes that we computed earlier 
by executing: 

ttestfromMeans(xl, x2, sdl, sd2, nl, n2) 

The result is the same as if we’d computed it from the raw data (see Output 9.3): 

[1] "t (df = 22) = 1.68134561495341, p = 0.106839192382597" 


9.5.2.I. General procedure for the independent f-test © 

To conduct an independent f-test you should follow this general procedure: 

1 Enter data. 

2 Explore your data : as with any analysis, it’s a good idea to begin by graphing your 
data and computing some descriptive statistics. You should also check distributional 
assumptions (see Chapter 5). 
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3 Compute the test: you can then run the 7-test. Depending on what you found in the 
previous step, you might need to run a robust version of the test. 

4 Calculate an effect size: it is useful to quantify your effect with an effect size. 

We will work through these steps in turn. 

9.5.2.2. Entering data © 


One of the strange things about the 7-test using R is that if you use the t.test() function, 
it actually doesn’t matter how you enter your data: you can enter it in both wide or long 
format: it does not matter at all because the function contains an option paired = TRUE/ 
FALSE, which tells it whether to treat data as dependent or independent. However, R 
Commander does care, so we’ll stick to convention and enter the data in a long format, 
which R Commander expects. You should have already entered the data if you completed 
the self-help test earlier in the chapter; if not, you can enter the data as: 

Group<-gl(2, 12, labels = c( "Picture", "Real Spider")) 

Anxiety<-c(30, 35, 45, 40, 50, 35, 55, 25, 30, 45, 40, 50, 40, 35, 50, 55, 
65, 55, 50, 35, 30, 50, 60, 39) 

The data are entered in two columns (one called Group which specifies whether a real 
spider or picture was used and one called Anxiety which indicates the person’s anxiety 
when faced with the picture/real spider). These commands create a variable called Anxiety 
with the 24 anxiety scores contained within it, and a variable called Group, which uses the 
gl() function to create a factor variable with two groups each containing 12 participants. 
Finally, we can merge these variables into a dataframe called spiderLong by executing: 

spiderLong<-data.frame(Group, Anxiety) 


9.5.2.3. The independent f-test using R Commander © 


As always, first import the data, usin g Data=>Import data=>from text file, clipboard, or 
URL... (see section 3.7.3), click 0K I and choose the file spiderLong.dat. 




FIGURE 9.4 

The independent 
7-test using R 
Commander 
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To run an independent t-test, choose Statistics=>Means=>Independent samples t-test. 
Figure 9.4 shows the dialog box that appears. On the left-hand side, in the list labelled 
Groups (pick one), choose a variable that distinguishes your two experimental groups. R 
Commander expects this variable to be a factor - there is only one in our data set, so this 
variable has been highlighted already (Group). On the right-hand side, in the list labelled 
Response Variable (pick one), choose the outcome variable. R Commander expects this 
variable to be numeric, and has highlighted the only variable in our data set, Anxiety. 

Our hypothesis is two-sided (or two-tailed), so that option can be left as it is, and we’d 
like 95% confidence intervals - although if we’d like a different confidence level, we can 
change .95 to a different value (.99, to get 99% confidence intervals, for example). Finally, 
we don’t want to make the assumption of equal variances: if we make the assumption, 
and we’re wrong, then our p-value will be wrong; however, if we don’t make the assump¬ 
tion when it would have been OK to make the assumption, it doesn’t matter, because the 
p-value won’t change (Jane Superbrain Box 9.2). To run the analysis click on 1 0K 1 . The 
output is described in section 9.5.2.6. 


9.5.2.4. Exploring data and testing assumptions 0 


In Chapter 4 we saw that it is always a good idea to look at a graph of your data. In this 
case we will produce a line graph with error bars. 




SELF-TEST 

s Use ggplot2 to produce a boxplot and bar chart with 
error bars showing confidence intervals for the spider 
data. 


The bar chart should look like Figure 9.2 and the boxplot is shown in Figure 9.5. The 
bar chart shows that the error bars overlap, indicating that, on face value, there are no 


FIGURE 9.5 
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between-group differences (although this measure is only approximate). The boxplot 
shows that the spread of scores was reasonably similar in the two groups, although the real 
spider condition exhibits a spread of scores below the median that is wider than that above 
the median. 

To get some descriptive statistics for each group we can use the by() function that we 
encountered in Chapter 5. Remember that this function takes the general form: 

byCvariable, group, output) 

in which variable is the thing that you want to summarize (in this case Anxiety), group is 
the variable that defines the groups by which you want to organize the output (in this case 
Group), and output is a function that tells R what output you would like to see (i.e., the 
mean). If we use the function stat.desc() from the package pastecs then R will output a host 
of useful descriptive statistics). Therefore, by combining by() and stat.desc(), we can get a 
table of descriptives for each group in a single line of code: 

by(spiderLong$Anxiety, spiderLong$Group, stat.desc, basic = FALSE, norm = 
TRUE) 

Output 9.2 shows the resulting descriptive statistics (I have edited the output slightly 
to fit the page so you will see more decimal places). From this output, we can see that the 
group who saw the picture of the spider had a mean anxiety of 40, with a standard devia¬ 
tion of 9.29. What’s more, the standard error of that group (the standard deviation of the 
sampling distribution) is 2.68 ( SE = 9.293/V12 = 9.293/3.464 = 2.68). In addition, the 
table tells us that the average anxiety level in participants who were shown a real spider 
was 47, with a standard deviation of 11.03 and a standard error of 3.18 (SE = 11.029/V12 
= 11.029/3.464 = 3.18). Also, both normality tests are non-significant (p = .852 for the 
picture group and p = .621 for the real group) implying that we can probably assume nor¬ 
mality of errors in the model. 

spiderLong$Group: Picture 

median mean SE.mean Cl.mean.0.95 var std.dev coef.var 
40.000 40.000 2.683 5.905 86.364 9.293 0.232 

skewness skew.2SE kurtosis kurt.2SE normtest.W normtest.p 
0.000 0.000 -1.394 -0.566 0.965 0.852 


spiderLong$Group: Real Spider 

median mean SE.mean Cl.mean.0.95 var std.dev coef.var 


50.000 47.000 3.1834 7.007 


121.636 11.029 0.235 


skewness skew.2SE kurtosis kurt.2SE normtest.W normtest.p 
-0.006 -0.004 -1.460 -0.592 0.949 0.621 

Output 9.2 


9.5.2.5. The independent f-test using R © 


To do a t-test we use the function t.testQ. There are two different ways that you can use this 
function and it depends on whether your group data are in a single column (as they are in 
spiderLong.dat) or if they are in two different columns (as they are in spiderWide.dat). If 
you have the data for different groups stored in a single column, then the t.test() function 
is used like the lm() function (in other words, like a regression): 

newModel<-t.test(outcome ~ predictor, data = dataFrame, paired = FALSE/TRUE) 
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in which: 

• newModel is an object created that contains information about the model. We can get 
summary statistics for this model by executing the name of the model. 

• outcome is a variable that contains the scores for the outcome measure (in this case 

Anxiety). 

• predictor is a variable that tells us to which group a score belongs (in this case Group). 

• dataFrame is the name of the dataframe containing the aforementioned variables. 

• paired determines whether or not you want to do a paired/dependent t-test (in which 
case include paired = TRUE) or an independent t-test (in which case exclude the 
option because this is the default, or include paired = FALSE). 

However, if you have the data for different groups stored in two columns, then the 
t.test() function takes this form: 

newModel<-t.test(scores group 1, scores group 2, paired = FALSE/TRUE) 
in which, the options are the same as before except: 

• scores group 1 is a variable that contains the scores for the first group. 

• scores group 2 is a variable that contains the scores for the second group. (If you want 
to do a one-sample t-test then simply exclude this second variable.) 

In both forms of the function, there are additional options that can be specified, but do 
not need to be if you are happy to use the defaults. These are: 

• alternative = “two.sided”/“less”i“greater”: This option determines whether you’re 
doing a two-tailed test, and if not the direction of your hypothesis. It has three pos¬ 
sible values: the default value is to do a two-tailed test ( alternative — “two.sided”, or 
don’t include the option). If you want to do a one-tailed test then you can specify 
either alternative = “less” (you predict that the difference between means will be less 
than zero) or alternative = “greater” (you predict that the difference between means 
will be greater than zero). 

• mu = 0: A difference between means of zero is the default null hypothesis, but can be 
changed. For example, including mu = 3 in the function would test the null hypoth¬ 
esis that the difference between means is different to 3. 

• var.equal: By default the function assumes that variances are unequal ( var.equal = 
FALSE). If for some reason you want to assume equal variances (we can’t think why 
you would), then include the option var.equal = TRUE. 

• conf.level = 0.95: This determines the alpha level for the p-value and confidence inter¬ 
vals. By default it is 0.95 (for 95% confidence intervals) and usually you’d exclude 
this option, but if you want to use a different value, say 99%, you could include conf. 
level = 0.99. 

• na.action: If you have complete data (as we have here) you can exclude this option, 
but if you have missing values (i.e., ‘NA’s in the dataframe) then it can be useful to 
use na.action = na.exclude, which will exclude all cases with missing values - see R’s 
Souls’ Tip 7.1. 

Therefore, we could carry out an independent t-test on the data in the spiderLong data¬ 
frame (which looks like this): 
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Group 

Anxiety 

1 

Picture 

30 

2 

Picture 

35 

3 

Picture 

45 

4 

Picture 

40 

5 

Picture 

50 

6 

Picture 

35 

7 

Picture 

55 

8 

Picture 

25 

9 

Picture 

30 

10 

Picture 

45 

11 

Picture 

40 

12 

Picture 

50 

13 

Real Spider 

40 

14 

Real Spider 

35 

15 

Real Spider 

50 

16 

Real Spider 

55 

17 

Real Spider 

65 

18 

Real Spider 

55 

19 

Real Spider 

50 

20 

Real Spider 

35 

21 

Real Spider 

30 

22 

Real Spider 

50 

23 

Real Spider 

60 

24 

Real Spider 

39 


by executing: 

ind.t.test<-t.test(Anxiety ~ Group, data = spiderLong) 
ind.t.test 

which creates a model called ind.t.test based on predicting anxiety scores (Anxiety) from 
group membership (Group). We can view this model by executing its name (hence the 
second command). 

Alternatively, if we’d input the data as in spiderWide, which looks like this: 



picture 

real 

1 

30 

40 

2 

35 

35 

3 

45 

50 

4 

40 

55 

5 

50 

65 

6 

35 

55 

7 

55 

50 

8 

25 

35 

9 

30 

30 

10 

45 

50 

11 

40 

60 

12 

50 

39 


we would need to run the £-test by executing: 

ind.t.test<-t.test(spiderWide$real, spiderWide$picture) 
ind.t.test 

These commands create a model called ind.t.test based on the variables real and picture in 
the spiderWide dataframe. As before, we view this model by executing its name. 
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9.5.2.6. Output from the independent f-test © 


Regardless of how you enter the data or specify the t.test() function, the output is basically 
identical: see Output 9.3. First we are given the value for t, the degrees of freedom and 
the p-value. The p-value is greater than .05, and hence we cannot reject the null hypoth¬ 
esis of no difference between the groups. Note that the t-value and p-value are the same 
as when we ran the analysis as a linear model in section 9.4.2 (Output 9.1). The p-value 
is a little bit different, because the degrees of freedom have been adjusted to correct for 
heteroscedasticity. 4 

The confidence intervals give the range of the difference that we would expect to include 
the true difference on 95% of occasions. The interval ranges from -15.6 to +1.65, which 
indicates a quite wide range of possible values for the true difference. Finally, we’re given 
the means of the two groups: -40 and 47 (we knew these already from the descriptive 
statistics). 

You might notice that the degrees of freedom are weird. Earlier on we said that the 
degrees of freedom are calculated by adding the two sample sizes and then subtracting 
the number of samples {df = N 1 + N 2 — 2 = 12 + 12 — 2 = 22); however, the output 
reports 21.39. This discrepancy is because this function uses a Welch’s f-test, which does 
not make the assumption of homogeneity of variance. The Welch uses a correction which 
adjusts the degrees of freedom based on the homogeneity of variance, so rather than 
22 degrees of freedom (as we’d expect) we have 21.39 degrees of freedom. This has had 
the effect of changing the p-value from 0.1068 to 0.107, both of which we would report 
as 0.107 anyway. When the sample sizes are equal, the adjustment will not make very 
much difference. (The formula is really big and complicated, and doesn’t make much 
sense to me anyway, but if you’re interested, Wikipedia has it: http://en.wikipedia.org/ 
wiki/Welch’s_t_test.) 

Welch Two Sample t-test 
data: Anxiety by Group 

t = -1.6813, df = 21.385, p-value = 0.1072 

alternative hypothesis: true difference in means is not equal to 0 

95 percent confidence interval: 

-15.648641 1.648641 

sample estimates: 

mean in group Picture mean in group Real Spider 
40 47 


Output 9.3 


9.5.2.7. Robust methods to compare independent means © 


Wilcox (2005) describes some robust procedures for comparing two means from indepen¬ 
dent groups. Load these functions using the instructions in section 5.8.4. Flaving done this, 


4 It’s possible to make the same adjustment with the lm() function to correct for heteroscedasticity, but it’s much 
more complicated. The approach is called a ‘sandwhich’ estimator, and it’s done using the sandwich () function. 
The sandwich function takes the arguments bread and meat. (Some descriptions written by vegetarians use bread 
and tofu.) (Really, we’re not joking: see http://www.bsos.umd.edu/gvpt/uslaner/robustregression.pdf.) 
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we now have access to Wilcox’s functions. Regardless of whether your data come from the 
same or different entities, these functions require the data to be in two different columns 
(one for each experimental condition). We already have the data in this format in the spi- 
derWide dataframe. 

This is the format that Wilcox’s functions expect. The first robust function, yuen(), is 
based on a trimmed mean. It takes the general form: 

yuen(scores group 1, scores group 2, tr = .2, alpha = .05) 

in which 

• scores group 1 is a variable that contains the scores for the first group. 

• scores group 2 is a variable that contains the scores for the second group. 

• tr is the proportion of trimming to be done. The default is .2 or 20%, and you need 
to use this option only if you want to specify an amount other than 20%. 

• alpha sets the alpha level for the test. You need to include this option only if you don’t 
want to use the conventional level of .05. 

As such, for a test of independent means based on 20% trimming we simply execute: 
yuen(spiderWide$real, spiderWide$picture) 

If we wanted to trim only 10% of the data then we could execute: 
yuen(spiderWide$real , spiderWide$picture, tr = .1) 

If you execute this command you will see Output 9.4, which shows that based on this 
robust test there is not a significant difference in anxiety scores across the two spider 
groups, T(13.91) = 1.296, p = .216. 

We can also compare trimmed means but include a bootstrap by using yuenbt(), which 
takes the general form: 

yuenbt(scores group 1, scores group 2, tr = .2, nboot = 599, alpha = .05, 
side = F) 

As you can see, this function takes the same form as yuen(), but has two additional 
instructions: 

• nboot = 599: This specifies the number of bootstrap samples to be used. If you 
exclude this option then the default is 599, which, if anything, you might want to 
increase (but it’s probably not necessary to use more than 2000). 

• side = F: By default the function bootstraps confidence intervals as is, which means 
that they can be asymmetric. If you want to force the confidence intervals to be sym¬ 
metrical then include side = T in the function. If you do this you will get a p-value, 
but by default you won’t (although you can infer significance from whether the con¬ 
fidence interval crosses zero). 

For a bootstrap test of independent means based on 20% trimming we simply execute: 

yuenbt(spiderWide$real, spiderWide$picture, nboot = 2000) 

If you execute this command you will see Output 9.4, which shows that based on this 
robust test there is not a significant difference (because the confidence interval crosses 
zero) in anxiety scores across the two spider groups, Y = 1.19 (-5.40, 17.87). 
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yuenQ output 

yuenbt() output 

pb2gen() output 

$ci 

$ci 

$ci 

[1] -4.429 17.929 

[11-5.399 17.869 

[1] -2.25 20.00 

$p.value 

$test.stat 

$p.value 

[1] 0.2161433 

[1] 1.193625 

[1] 0.16 

$dif 

$p.value 

$sq.se 

[1] 6.75 

[1] NA 

[1] 21.96355 

$se 

[1] 5.209309 



$teststat 
[1] 1.295757 



$crit 

[1] 2.146035 



$df 

[1] 13.91372 




Output 9.4 

A final method is to use a bootstrap and an M-estimator (rather than trimmed mean) by 
applying pb2gen() function. This function has the general form: 

pb2gen(spiderWide$real , spiderWide$picture, alpha=.05, nboot=2000, est = 
mom) 

which is the same as yuenbt() except that we can chose an estimator (the default of mom is 
fine). As such, for a bootstrap test of independent M-estimators we execute: 

pb2gen(spiderWide$real, spiderWide$picture, nboot=2000) 

If you execute the pb2gen() function with the default settings you will see Output 9.4, 
which shows that based on this robust test there is not a significant difference (because 
the confidence interval crosses zero) in anxiety scores across the two spider groups, p = 
.16. In short, all three robust methods suggest that the type of spider stimulus does not 
affect anxiety. 


9.5.2.8. Calculating the effect size © 

Even though our t-statistic is not statistically significant, this doesn’t necessarily mean that 
our effect is unimportant in practical terms. To discover whether the effect is substantive 
we need to use what we know about effect sizes (see section 2.6.4). I’m going to stick with 
the effect size r because it’s widely understood and frequently used. Converting a t-value 
into an r-value is actually really easy; we can use the following equation (e.g. Rosenthal, 
1991; Rosnow & Rosenthal, 2005): 


2 


r = 


t 2 +df 
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We know the value of t and the df from the R output and so we can compute r as follows: 

r= / (- 1 - 681) 2 . I 2l26 _ 34 

]j (-1.681) 2 + 21.39 V 24.22 

We can also calculate r (the effect size) using R. The value of t is stored in our model 
as a variable called statistical]] and the degrees of freedom are stored as parameter[[l]]. 
(Actually statistic and parameter can contain many things and so the ‘[[1]]’ tells R that we 
want the first value only.) We can access these values just as we would any other variable, 
we tell R where to find them (i.e., the name of the model, in this case ind.t.test) and then 
append a dollar sign and the name of the variable. Therefore, we can create a variable t that 
contains the value of t by executing: 

t<-ind.t.test$statistic[[1]] 

We can similarly create a variable called df containing the degrees of freedom by 
executing: 

df<-ind.t.test$parameter[[l]] 

We can then calculate r by executing: 
r <- sqrt(t A 2/(t A 2+df)) 

This command is simply the equation above but in R-speak, and it creates a variable called 
r. If we want to see the value we could execute the variable name, or use the roundQ func¬ 
tion to display it rounded off to, say 3 decimal places: 

roundCr, 3) 

The result is the same as if we calculated by hand (r = .342). If you think back to our bench¬ 
marks for effect sizes, this represents a medium effect (it is around .3, the threshold for a 
medium effect). Therefore, even though the effect was non-significant, it still represented 
a fairly substantial effect. 

9.5.2.9. Reporting the independent f-test © 

There is a fairly standard way to report any test statistic: you usually state the finding to which 
the test relates and then report the test statistic, its degrees of freedom and the probability 
value of that test statistic. There has also been a recent move (by the American Psychological 
Association among others) to recommend that an estimate of the effect size is routinely reported. 
Although effect sizes are still rather sporadically used, I want to get you into good habits so 
we’ll start thinking about effect sizes now. The R output tells us that the value of t was —1.68, 
that the number of degrees of freedom on which this was based was 21.39, and that it was 
not significant at p < .05. We can also see the means for each group. We could write this as: 

^ On average, participants experienced greater anxiety from real spiders (M = 47.00, 
SE = 3.18), than from pictures of spiders (M = 40.00, SE = 2.68). This difference was 
not significant t(21.39) = —1.68, p > .05; however, it did represent a medium-sized 
effect r = .34. 

Note how we’ve reported the means in each group (and standard errors) as before. For 
the test statistic everything is much the same as before except that I’ve had to report that 
p was greater than (>) .05 rather than less than (<). Finally, note that I’ve commented on 
the effect size at the end. 
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CRAMMING SAM’S TIPS 


The independent t-test 


• The independent f-test compares two means, when those means have come from different groups of entities; for example, 
if you have used different participants in each of two experimental conditions. 

• Look at the value labelled p-value. If the value is less than .05 then the means of the two groups are significantly different. 

• Look at the values of the means to tell you how the groups differ. 

• Report the f-statistic, the degrees of freedom and the p-value. Also report the means and their corresponding standard errors 
(or draw an error bar chart). 

• Calculate and report the effect size. 


9.6. The dependent f-test © 


As with the independent Z-test, the dependent Z-test is a numeric version of equation (9.1): 


_ D 

” s d a/n (9 ‘ 6) 

Equation (9.6) compares the mean difference between our samples (D) to the difference 
that we would expect to find between population means (/z D ), and then takes into account 
the standard error of the differences {s D /yfff). If the null hypothesis is true, then we expect 
there to be no difference between the population means (hence yi D = 0). 


| Sampling distributions and the standard error © 


In equation (9.1) I referred to the lower half of the equation as the standard error of dif¬ 
ferences. The standard error was introduced in section 2.5.1 and is simply the standard 
deviation of the sampling distribution. Have a look back at this section now to refresh your 
memory about sampling distributions and the standard error. Sampling distributions have 
several properties that are important. For one thing, if the population is normally distrib¬ 
uted then so is the sampling distribution; in fact, if the samples contain more than about 50 
scores the sampling distribution should be normally distributed. The mean of the sampling 
distribution is equal to the mean of the population, so the average of all possible sample 
means should be the same as the population mean. This property makes sense because if a 
sample is representative of the population then you would expect its mean to be equal to 
that of the population. However, sometimes samples are unrepresentative and their means 
differ from the population mean. On average, though, a sample mean will be very close to 
the population mean and only rarely will the sample mean be substantially different from 
that of the population. A final property of a sampling distribution is that its standard devia¬ 
tion is equal to the standard deviation of the population divided by the square root of the 
number of observations in the sample. As I mentioned before, this standard deviation is 
known as the standard error. 

We can extend this idea to look at the differences between sample means. If you were to 
take several pairs of samples from a population and calculate their means, then you could 
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also calculate the difference between their means. I mentioned earlier that on average sam¬ 
ple means will be very similar to the population mean: as such, most samples will have very 
similar means. Therefore, most of the time the difference between sample means from the 
same population will be zero, or close to zero. However, sometimes one or both of the 
samples could have a mean very deviant from the population mean and so it is possible 
to obtain large differences between sample means by chance alone. However, this would 
happen less frequently. 

In fact, if you plotted these differences between sample means as a histogram, you would 
again have a sampling distribution with all of the properties previously described. The 
standard deviation of this sampling distribution is called the standard error of differences. 
A small standard error tells us that most pairs of samples from a population will have very 
similar means (i.e., the difference between sample means should normally be very small). 
A large standard error tells us that sample means can deviate quite a lot from the popula¬ 
tion mean and so differences between pairs of samples can be quite large by chance alone. 


9.6.2. 


The dependent f-test equation explained © 


In an experiment, a person’s score in condition 1 will be different from their 
score in condition 2, and this difference could be very large or very small. If 
we calculate the differences between each person’s score in each condition 
and add up these differences we would get the total amount of difference. 

If we then divide this total by the number of participants we get the average 
difference (thus how much, on average, a person’s score differed in condi¬ 
tion 1 compared to condition 2). This average difference is D in equation 
(9.6) and it is an indicator of the systematic variation in the data (i.e., it 
represents the experimental effect). We need to compare this systematic vari¬ 
ation against some kind of measure of the ‘systematic variation that we could 
naturally expect to find’. In Chapter 2 we saw that the standard deviation 
was a measure of the ‘fit’ of the mean to the observed data (i.e., it measures the error in 
the model when the model is the mean), but it is does not measure the fit of the mean to 
the population. To do this we need the standard error (see the previous section, where we 
revised this idea). 

The standard error is a measure of the error in the mean as a model of the population. 
In this context, we know that if we had taken two random samples from a population (and 
not done anything to these samples) then the means could be different just by chance. The 
standard error tells us by how much these samples could differ. A small standard error 
means that sample means should be quite similar, so a big difference between two sample 
means is unlikely. In contrast, a large standard error tells us that big differences between 
the means of two random samples are more likely. Therefore it makes sense to compare 
the average difference between means against the standard error of these differences. This 
gives us a test statistic that, as I’ve said numerous times in previous chapters, represents 
model/error. Our model is the average difference between condition means, and we divide 
by the standard error which represents the error associated with this model (i.e., how simi¬ 
lar two random samples are likely to be from this population). 

To clarify, imagine that an alien came down and cloned me millions of times. This popu¬ 
lation is known as Landy of the Andys (this would be possibly the most dreary and strangely 
terrifying place I could imagine). Imagine the alien was interested in arachnophobia in this 
population (because I am petrified of spiders). Everyone in this population (my clones) will 
be the same as me, and would behave in an identical way to me. If you took two samples 
from this population and measured their fear of spiders, then the means of these samples 



How does the 
t-test actually work? 
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would be the same (we are clones), so the difference between sample means would be zero. 
Also, because we are all identical, then all samples from the population will be perfect 
reflections of the population (the standard error would be zero also). Therefore, if we were 
to get two samples that differed even very slightly then this would be very unlikely indeed 
(because our population is full of cloned Andys). Therefore, a difference between samples 
must mean that they have come from different populations. Of course, in reality we don’t 
have samples that perfectly reflect the population, but the standard error gives an idea of 
how well samples reflect the population from which they came. 

Therefore, by dividing by the standard error we are doing two things: (1) standardizing 
the average difference between conditions (this just means that we can compare values of 
t without having to worry about the scale of measurement used to measure the outcome 
variable); and (2) contrasting the difference between means that we have against the dif¬ 
ference that we could expect to get based on how well the samples represent the popula¬ 
tions from which they came. If the standard error is large, then large differences between 
samples are more common (because the distribution of differences is more spread out). 
Conversely, if the standard error is small, then large differences between sample means are 
uncommon (because the distribution is very narrow and centred around zero). Therefore, 
if the average difference between our samples is large, and the standard error of differences 
is small, then we can be confident that the difference we observed in our sample is not a 
chance result. If the difference is not a chance result then it must have been caused by the 
experimental manipulation. 

In a perfect world, we could calculate the standard error by taking all possible pairs 
of samples from a population, calculating the differences between their means, and then 
working out the standard deviation of these differences. However, in reality this is impos¬ 
sible. Therefore, we estimate the standard error from the standard deviation of differences 
obtained within the sample (s D ) and the sample size ( N ). Think back to section 2.5.1 where 
we saw that the standard error is simply the standard deviation divided by the square root 
of the sample size; likewise the standard error of differences (o^) is simply the standard 
deviation of differences divided by the square root of the sample size: 


If the standard error of differences is a measure of the unsystematic variation within the 
data, and the sum of difference scores represents the systematic variation, then it should 
be clear that the t-statistic is simply the ratio of the systematic variation in the experiment 
to the unsystematic variation. If the experimental manipulation creates any kind of effect, 
then we would expect the systematic variation to be much greater than the unsystematic 
variation (so at the very least, t should be greater than 1). If the experimental manipulation 
is unsuccessful then we might expect the variation caused by individual differences to be 
much greater than that caused by the experiment (so t will be less than 1). We can com¬ 
pare the obtained value of t against the maximum value we would expect to get by chance 
alone in a t-distribution with the same degrees of freedom (these values can be found in the 
Appendix); if the value we obtain exceeds this critical value we can be confident that this 
reflects an effect of our independent variable. 


9.6.3. 


Dependent f-tests using R © 


Using our spider data again, we’ll now assume that the data were collected using the same 
participants (spiderWide.dat): we have 12 arachnophobes who were exposed to a picture 
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of a spider (picture) and on a separate occasion a real live tarantula (real). Their anxiety 
was measured in each condition (half of the participants were exposed to the picture before 
the real spider while the other half were exposed to the real spider first). 


9.6.3.1. General procedure for the dependent f-test Q 

To conduct a dependent t-test you should follow the same general procedure as for the 
independent t-test: 

1 Enter data. 

2 Explore your data: begin by graphing your data and computing some descriptive 
statistics. You should also check distributional assumptions (see Chapter 5). We have 
done this already so we won’t do it again - the data are the same as the previous 
example. 

3 Compute the test: you can then run the t-test. Depending on what you found in the 
previous step, you might need to run a robust version of the test. 

4 Calculate an effect size: it is useful to quantify your effect with an effect size. 

We will work through these steps in turn. 


9.6.3.2. Entering data © 


As I mentioned before, R does not expect your data in a particular format for the t-test, 
though R Commander does. For the dependent t-test it expects data in the wide format. 
You should have already entered the data in the dataframe spiderWide, if not, you can enter 
the data as: 

picture<-c(30, 35, 45, 40, 50, 35, 55, Z5, 30, 45, 40, 50) 
real<-c(40, 35, 50, 55, 65, 55, 50, 35, 30, 50, 60, 39) 

These commands create a variable called picture which contains the anxiety scores when a 
picture was used and a variable called real which contains the corresponding anxiety scores 
when faced with the real spider. We can merge these variables into a dataframe called spi¬ 
derWide by executing: 

spiderWide<-data.frameCpicture, real) 


9.6.3.3. The dependent f-test using R Commander © 

As always, import the data, using Data=>Import data=>from text file, clipboard, or URL... 
(see section 3.7.3) click on ok | and choose the file spiderWide.dat. 

To run a dependent t-test, choose Statistics=>Means=>Paired t-test.... Figure 9.6 shows 
the dialog box that appears. On the left-hand side, in the list labelled First variable (pick 
one) choose a variable representing your first experimental group (Fve chosen picture). 
On the right-hand side, in the list labelled Second variable (pick one), choose the variable 
representing your second experimental group (Fve chosen real). 

Our hypothesis is two-sided (or two-tailed), so that option can be left as it is, and we’d 
like 95% confidence intervals - although if we’d like a different confidence level, we can 
change .95 to a different value (.99 to get 99% confidence intervals, for example). To run 
the analysis click on [ ok I. The output is described in section 9.6.3.6. 
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FIGURE 9.6 

The dependent 
f-test using R 
Commander 





9.6.3.4. Exploring data and testing assumptions 0 

We’ve explored these data already in section 9.5.2.4. I’ll just remind you that with the data 
in this format you can get descriptive statistics by executing the following command: 

stat.desc(spiderWide, basic = FALSE, norm = TRUE) 

We talked about the assumption of normality in Chapter 5 and discovered that parametric 
tests (like the dependent t-test) assume that the sampling distribution is normal. This should 
be true in large samples, but in small samples people often check the normality of their data 
because if the data themselves are normal then the sampling distribution is likely to be also. 
With the dependent t-test we analyse the differences between scores because we’re inter¬ 
ested in the sampling distribution of these differences (not the raw data). Therefore, if you 
want to test for normality before a dependent t-test then what you should do is compute 
the differences between scores, and then check if this new variable is normally distributed 
(or use a big sample and not worry about normality). It is possible to have two measures 
that are highly non-normal that produce beautifully distributed differences. 



SELF-TEST 

s Using the spiderWide.dat data, compute the 
differences between the picture and real condition 
and check the assumption of normality for these 
differences. 


9.6.3.5. The dependent f-test using R © 


To do a dependent t-test we again use the function t.test() but this time include the option 
paired = TRUE. In section 9.5.2.5 we saw that the form of the command depends on the 
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format of the data. In this case we have scores from different groups stored in different 
columns, so we could execute: 

dep.t.test<-t.test(spiderWide$real, spiderWide$picture, paired = TRUE) 
dep.t.test 

Note that this command is identical to one of the ones we used in section 9.5.2.5 for the 
independent t-test, except that we have included paired = TRUE so that R knows to treat 
the scores as dependent. These commands create a model called dep.t.test based on the 
variables real and picture in the spiderWide dataframe. We view this model by executing its 
name (hence the second command). 

If we had our data stored in long format so that our group scores are in a single column 
and group membership is expressed in a second column (as they are in spiderLong.dat), 
we can still run a dependent t-test. Again, we run it in the same way that we did for an 
independent t-test but we include the paired = TRUE option: 

dep.t.test<-t.test(Anxiety ~ Group, data = spiderLong, paired = TRUE) 

dep.t.test 

which creates a model called dep.t.test based on predicting anxiety scores (Anxiety) from 
group membership (Group). We can view this model by executing its name. 


9.6.3.6. Output from the dependent f-test © 


Regardless of how you enter the data or specify the t.test() function, the output is identi¬ 
cal: see Output 9.5. The test statistic, t, is calculated by dividing the mean of differences by 
the standard error of differences (see equation (9.6): t = 2.47). The size of t is compared 
against known values based on the degrees of freedom. When the same participants have 
been used, the degrees of freedom are simply the sample size minus 1 (df = N — 1 = 11) - 
you should check this value is what you expect it to be, to ensure you haven’t made a 
mistake. R uses the degrees of freedom to calculate the exact probability that a value 
of t as big as the one obtained could occur if the null hypothesis were true (i.e., there 
was no difference between these means). The probability for the spider data is very low 
( p = .031) and in fact it tells us that there is only a 3.1% chance that a value of t this big could 
happen if the null hypothesis were true. We saw in Chapter 2 that we generally accept a 
p < .05 as statistically meaningful; therefore, this t is significant because .031 is smaller 
than .05. The fact that the t-value is a positive number tells us that the first condition (the 
real condition) had a larger mean than the second (the picture condition) and so the real 
spider led to greater anxiety than the picture. Therefore, we can conclude that exposure to 
a real spider caused significantly more reported anxiety in arachnophobes than exposure 
to a picture, t( 11) = 2.47, p < .05. 

Paired t-test 

data: spiderwide$real and spiderwide$picture 

t = 2.4725, df = 11, p-value = 0.03098 

alternative hypothesis: true difference in means is not equal to 0 

95 percent confidence interval: 

0.7687815 13.2312185 

sample estimates: 
mean of the differences 

7 


Output 9.5 
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This output provides a 95% confidence interval for the mean difference. Imagine we took 
100 samples from a population of difference scores and calculated their means (D) and a con¬ 
fidence interval for that mean. In 95 of those samples the constructed confidence intervals 
contains the true value of the mean difference. The confidence interval tells us the boundaries 
within which the true mean difference is likely to lie. So, assuming this sample’s confidence 
interval is one of the 95 out of 100 that contains the population value, we can say that the true 
mean difference lies between 0.77 and 13.23. The importance of this interval is that it does not 
contain zero (i.e., both limits are positive) because this tells us that the true value of the mean 
difference is unlikely to be zero. Crucially, if we were to compare pairs of random samples 
from a population we would expect most of the differences between sample means to be zero. 
But since our interval, based on our two samples, does not contain zero, we can be confident 
that our two samples do not represent random samples from the same population. Instead they 
represent samples from different populations induced by the experimental manipulation. 


9.6.3.7. Robust methods to compare dependent means <D 


As with independent means, there are equivalent robust functions to test dependent groups 
in Rand Wilcox’s (2005) book. Load these functions using the instructions in section 5.8.4. 
As with the functions for independent designs, the data need to be in two different columns 
(one for each experimental condition). We already have the data in this format in the spi- 
derWide dataframe. 

The first robust function, yuend(), is based on a trimmed mean. It takes the general form: 

yuend(scores group 1, scores group 2, tr = .2, alpha = .05) 

In other words, it works in exactly the same way as the yuen() function in section 9.5.2.7. 
Refer back to that section for a more detailed description of the format of these functions. 
As such, for a test of dependent means based on 20% trimming we simply execute: 

yuend(spiderWide$real, spiderWide$picture) 

If you execute this command you will see Output 9.6, which shows that based on this 
robust test there is not a significant difference in anxiety scores across the two spider 
groups, T(7) = 1.86, p = .106. 


yuend() output 

ydbt() output 

$ci 

$ci 

[1] -1.843818 15.343818 

[11-1.6298 15.1298 

$siglevel 

$dif 

[1] 0.1056308 

[1] 6.75 

$dif 

$p.value 

[1] 6.75 

[1] 0.105 [1] NA 

$se 


[1] 3.634327 


$teststat 


[1] 1.85729 


$df 


[1] 7 



Output 9.6 
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We can also compare trimmed means but include a bootstrap by using ydbt(), which takes 
the general form: 

ydbtCscores group 1, scores group 2, tr = .2, nboot = 599, alpha = .05, side 

= F) 

As you can see, this function takes the same form as yuenbtQ in section 9.5.2.7. As such, for 
a bootstrap test of dependent means based on 20% trimming we simply execute: 

ydbt(spiderWide$real , spiderWide$picture, nboot = 2000) 

If you execute this command you will see Output 9.6, which shows that based on this 
robust test there is not a significant difference (because the confidence interval crosses 
zero) in anxiety scores across the two spider groups, Y ( = 6.75 (-1.63, 15.13), p = .105. 

A final method is to use a bootstrap and an M-estimator (rather than trimmed mean) by 
applying bootdpci() function. This function has the general form: 

bootdpci(scores group 1, scores group 2, alpha=.05, nboot=2000, est = tmean) 

For a bootstrap test of dependent M-estimators we execute: 

bootdpci(spiderWide$real, spiderWide$picture, est=tmean, nboot=2000) 

If you execute the bootdpci() function with the default settings you will see Output 9.7, 
which shows that based on this robust test there is a significant difference (because the 
confidence interval does not cross zero and p is less than .05) in anxiety scores across the 
two spider groups, \|/= 7.5 (0.50, 13.13), p = .037. In short, the robust methods disagree 
about whether the type of spider stimulus does not affect anxiety. 

$output 

con.num psihat p.value p.crit ci.lower ci.upper 
[1,] 1 7.5 0.037 0.05 0.5 13.125 

Output 9.7 

9.6.3.8. Calculating the effect size (D 


Even though our t-statistic is statistically significant, this doesn’t mean our effect is impor¬ 
tant in practical terms. To discover whether the effect is substantive we need to use what 
we know about effect sizes (see section 2.6.4). We can compute this value in the same way 
that we did for the independent t-test (section 9.5.2.8) by executing: 5 

t<-dep.t.test$statistic[[l]] 
df<-dep.t.test$parameter[[l]] 
r <- sqrt(t A 2/(t A 2+df)) 
roundCr, 3) 

[1] 0.598 

Notice that the code is identical to last time we used it except that we have used the dep.t.test 
model to get the values of t and the degrees of freedom. You may also notice that the effect 
has grown. If you think back to our benchmarks for effect sizes this represents a very large 
effect (it is above .5, the threshold for a large effect). Therefore, as well as being statistically 
significant, this effect is large and probably substantive finding. This growth in the effect 
size might seem slightly odd given that we used exactly the same data (but see section 9.7). 


5 Actually, this will overestimate the effect size because of the correlation between the two conditions. This is quite 
a technical issue and I’m trying to keep things simple here, but bear this in mind and if you’re interested read 
Dunlap, Cortina, Vaslow, and Burke (1996). 
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9.6.3.9. Reporting the dependent f-test © 


The rules that I made up - erm, I mean, reported - for the independent t-test pretty much 
apply for the dependent t-test. In this example the R output tells us that the value of t was 
2.47, that this was based on 11 degrees of freedom, and that it was significant at p = .031. 
We can also see the means for each group. We could write this as: 

^ On average, participants experienced significantly greater anxiety from real spiders 
(M = 47.00, SE = 3.18) than from pictures of spiders (M = 40.00, SE = 2.68), t( 11) = 
2.47, p < .05, r=.60. 

Note how we’ve reported the means in each group (and standard errors) in the standard 
format. For the test statistic, note that we’ve used an italic t to denote the fact that we’ve 
calculated a t-statistic, then in brackets we’ve put the degrees of freedom and then stated 
the value of the test statistic. The probability can be expressed in several ways: often people 
report things to a standard level of significance (such as .05) as I have done here, but some¬ 
times people will report the exact significance. Finally, note that I’ve reported the effect 
size at the end - you won’t always see this in published papers but that’s no excuse for you 
not to report it! 

Try to avoid writing vague, unsubstantiated things like this: 

X People were more scared of real spiders (t = 2.47). 


More scared than what? Where are the degrees of freedom? Was the result statistically 
significant? Was the effect important (what was the effect size)? 



CRAMMING SAM’S TIPS 


The dependent f-test 


• The dependent t -test compares two means, when those means have come from the same entities; for example, if you have 
used the same participants in each of two experimental conditions. 

• Look at the p-value. If the value is less than .05 then the means of the two conditions are significantly different. 

• Look at the values of the means to tell you how the conditions differ. 

• Report the f-statistic, the degrees of freedom and the significance value. Also report the means and their corresponding 
standard errors. 

• If you’re feeling brave, calculate and report the effect size too. 


9.7. Between groups or repeated measures? © 


The two examples in this chapter are interesting (honestly!) because they illustrate the dif¬ 
ference between data collected using the same participants and data collected using different 
participants. The examples use the same scores in each condition. When analysed as though 
the data came from the same participants the result was a significant difference between 
means, but when analysed as though the data came from different participants there was no 
significant difference between group means. This may seem like a puzzling finding - after 
all the numbers were identical in both examples. What this illustrates is the relative power 
of repeated-measures designs. When the same participants are used across conditions the 
unsystematic variance (often called the error variance) is reduced dramatically, making it 
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Labcoat Leni’s Real Research 9.1 


Board, B. J., & Fritzon, K. (2005). Psychology, Crime & Law , 11 , 17-32. 


You don’t have to 
be mad to work here, 
but it helps® 


In the UK you often see the 'humorous’ slogan ‘You don't have to be mad to work here, but it helps’ displayed 
in work places. Well, Board and Fritzon (2005) took this a step further by measuring whether 39 senior busi¬ 
ness managers and chief executives from leading UK companies were mad (well, had personality disorders). 
They tested them with the Minnesota Multiphasic Personality Inventory Scales for DSM III Personality Disorders 
(MMPI-PD), which is a well-validated measure of 11 personality disorders: Histrionic, Narcissistic, Antisocial, 
Borderline, Dependent, Compulsive, Passive-aggressive, Paranoid, Schizotypal, Schizoid and Avoidant. They 
needed a comparison group, and what better one to choose than 317 legally classified psychopaths at Broadmoor 
Hospital (a high-security psychiatric hospital in the UK). 

The authors report the means and standard deviations for these two groups in Table 2 of their paper. Using 
these values we can run f-tests on these means. The data from Board and Fritzon’s Table 2 are in the file 
Board&Fritzon2005.dat. Use this file to run f-tests to see whether managers score higher on personal¬ 
ity disorder questionnaires than legally classified psychopaths. 

Report these results. What do you conclude? 

Answers are in the additional material on the companion website (or look at Table 2 in the original article). 


'^4 


easier to detect any systematic variance. It is often assumed that the way in which you col¬ 
lect data is irrelevant, but I hope to have illustrated that it can make the difference between 
detecting a difference and not detecting one. In fact, researchers have carried out studies 
using the same participants in experimental conditions, then repeated the study using differ¬ 
ent participants in experimental conditions, and then used the method of data collection as 
an independent variable in the analysis. Typically, they have found that the method of data 
collection interacts significantly with the results found (see Erlebacher, 1977). 


What have I discovered about statistics? © 


We started this chapter by looking at my relative failures as a human being compared to 
Simon Hudson before investigating some problems with the way R produces error bars for 
repeated-measures designs. We then had a look at some general conceptual features of the 
f-test, a parametric test that’s used to test differences between two means. After this general 
taster, we moved onto look specifically at the dependent f-test (used when your conditions 
involve the same entities). I explained how it was calculated, how to do it in R and how to 
interpret the results. We then discovered much the same for the independent f-test (used 
when your conditions involve different entities). I also rattled on excitedly about how a sit¬ 
uation with two conditions can be conceptualized as a general linear model, by which point 
those of you who have a life had gone to the pub for a stiff drink. My excitement about 
things like general linear models could explain why Clair Sparks chose Simon Hudson all 
those years ago. Perhaps she could see the writing on the wall! Fortunately, I was a ruth¬ 
less pragmatist at the age of 10, and the Clair Sparks episode didn’t seem to concern me 
unduly; I just set my sights elsewhere during the obligatory lunchtime game of kiss chase. 
These games were the last I would see of women for quite some time ... 
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R packages used in this chapter 


ggpiot2 

Rcmdr 

pastecs 

WRS 

R functions used in this chapter 

bootdpci() 

round() 

byo 

sandwich () 

data.frame() 

sd() 

head() 

sqrt() 

length() 

stat.desc() 

ImO 

summaryO 

mean() 

t.test() 

names() 

ydbto 

paste () 

yuen() 

pb2gen() 

yuenbt() 

return () 

yuend() 

Key terms that I’ve discovered 

Dependent t- test 

Standard error of 

Grand mean 

Variance sum law 

Independent f-test 

Welch's f-test 


Smart Alex’s tasks 


These scenarios are taken from Field and Hole (2003). In each case analyse the data in R. 



• Task 1: One of my pet hates is ‘pop psychology’ books. Along with banishing Freud 
from all bookshops, it is my avowed ambition to rid the world of these rancid putre¬ 
faction-ridden wastes of trees. Not only do they give psychology a very bad name by 
stating the bloody obvious and charging people for the privilege, but they are also 
considerably less enjoyable to look at than the trees killed to produce them (admit¬ 
tedly the same could be said for the turgid tripe that I produce in the name of educa¬ 
tion, but let’s not go there just for now). Anyway, as part of my plan to rid the world 
of popular psychology I did a little experiment. I took two groups of people who 
were in relationships and randomly assigned them to one of two conditions. One 
group read the famous popular psychology book "Women Are from Bras, Men Are from 
Penis, whereas another group read Marie Claire. I tested only 10 people in each of 
these groups, and the dependent variable was an objective measure of their happiness 
with their relationship after reading the book. I didn’t make any specific prediction 
about which reading material would improve relationship happiness. The data are in 
the file Penis.dat. Analyse them with the appropriate f-test. © 
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• Task 2: Imagine Twaddle and Sons, the publishers of Women are from Bras, Men 
are from Penis, were upset about my claims that their book was about as useful as a 
paper umbrella. They decided to take me to task and design their own experiment in 
which participants read their book and one of my books (Field and Hole) at different 
times. Relationship happiness was measured after reading each book. To maximize 
their chances of finding a difference they used a sample of 500 participants, but got 
each participant to take part in both conditions (they read both books). The order 
in which books were read was counterbalanced and there was a delay of 6 months 
between reading the books. They predicted that reading their wonderful contribu¬ 
tion to popular psychology would lead to greater relationship happiness than read¬ 
ing some dull and tedious book about experiments. The data are in FieldHole.dat. 
Analyse them using the appropriate f-test. © 

Answers can be found on the companion website (or for more detail see Field and Hole, 
2003). 



Further reading 


Field, A. P., & Hole, G. (2003). How to design and report experiments. London: Sage. (In my com¬ 
pletely unbiased opinion this is a useful book to get some more background on experimental 
methods.) 

Miles, J. N. V, 8c Banyard, P. (2007). Understanding and using statistics in psychology: A practical 
introduction. London: Sage. (A fantastic and amusing introduction to statistical theory.) 

Rosnow, R. L., 8c Rosenthal, R. (2005). Beginning behavioral research: A conceptual primer (5th ed.). 
Upper Saddle River, NJ: Pearson/Prentice Hall. 

Wright, D. B., 8c London, K. (2009). First steps in statistics (2nd ed.). London: Sage. (This book has 
very clear introductions to the t-test.) 


Interesting real research 


Board, B. J., 8c Fritzon, K. (2005). Disordered personalities at work. Psychology, Crime & Law, 
11(1), 17-32. 





Comparing several means: 
20 ANOVA(GLMl) 



FIGURE 10.1 

My brother Paul 
(left) and I (right) 
in our very fetching 
school uniforms. 



10.1. What will this chapter tell me? © 


There are pivotal moments in everyone’s life, and one of mine was at the age of 11. Where 
I grew up in England there were three choices when leaving primary school and moving 
onto secondary school: (1) state school (where most people go); (2) grammar school (where 
clever people who pass an exam called the 11+ go); and (3) private school (where rich 
people go). My parents were not rich and I am not clever and consequently I failed my 11+, 
so private school and grammar school (where my clever older brother had gone) were out. 
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This left me to join all of my friends at the local state school. I could not have been happier. 
Imagine everyone’s shock when my parents received a letter saying that some extra spaces 
had become available at the grammar school; although the local authority could scarcely 
believe it and had checked the 11+ papers several million times to confirm their findings, 
I was next on their list. I could not have been unhappier. So, I waved goodbye to all of my 
friends and trundled off to join my brother at Ilford County High School for Boys (a school 
that still hit students with a cane if they were particularly bad and that, for some consid¬ 
erable time and with good reason, had ‘H.M. Prison’ painted in huge white letters on its 
roof). It was goodbye to normality, and hello to 6 years of learning how not to function in 
society. I often wonder how my life would have turned out had I not gone to this school; 
in the parallel universes where the letter didn’t arrive and Andy went to state school, or 
where my parents were rich and Andy went to private school, what became of him? If we 
wanted to compare these three situations we couldn’t use a t-test because there are more 
than two conditions. 1 However, this chapter tells us all about the statistical models that we 
use to analyse situations in which we want to compare more than two conditions: analysis of 
variance (or ANOVA to its friends). This chapter will begin by explaining the theory of 
ANOVA when different participants are used {independent ANOVA). We’ll then look at how 
to carry out the analysis in R and interpret the results. 


10.2. The theory behind ANOVA © 


10.2.1 


Inflated error rates © 


Before explaining how ANOVA works, it is worth mentioning why we don’t sim¬ 
ply carry out several t-tests to compare all combinations of groups that have been 
tested. Imagine a situation in which there were three experimental conditions and 
we were interested in differences between these three groups. If we were to carry 
out t-tests on every pair of groups, then that would involve doing three separate 
tests: one to compare groups 1 and 2, one to compare groups 1 and 3, and one 
to compare groups 2 and 3. If each of these t-tests uses a .05 level of significance 
then for each test the probability of falsely rejecting the null hypothesis (known 
as a Type I error) is only 5%. Therefore, the probability of no Type I errors is .95 
(95%) for each test. If we assume that each test is independent (hence, we can 
multiply the probabilities) then the overall probability of no Type I errors is .95 3 = .95 x 
.95 x .95 = .857, because the probability of no Type I errors is .95 for each test and there 
are three tests. Given that the probability of no Type I errors is .857, then we can calcu¬ 
late the probability of making at least one Type I error by subtracting this number from 1 
(remember that the maximum probability of any event occurring is 1). So, the probability 
of at least one Type I error is 1 — .857 = .143, or 14.3%. Therefore, across this group of 
tests, the probability of making a Type I error has increased from 5% to 14.3%, a value 
greater than the criterion accepted by scientists. This error rate across statistical tests con¬ 
ducted on the same experimental data is known as the familywise or experimentwise error 
rate. An experiment with three conditions is a relatively simple design, and so the effect 
of carrying out several tests is not severe. If you imagine that we now increase the number 
of experimental conditions from three to five (which is only two more groups) then the 



Really, this is the least of our problems: there’s the small issue of needing access to parallel universes. 
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number of t-tests that would need to done increases to 10. 2 The familywise error rate can 
be calculated using the following general equation: 

familywise error = 1-(0.95)" (10.1) 


in which n is the number of tests carried out on the data. With 10 tests carried out, the 
familywise error rate is 1 — .95 10 = .40, which means that there is a 40% chance of having 
made at least one Type I error. For this reason we use ANOVA rather than conducting lots 
of t-tests. 


10 . 2 . 2 . 


Interpreting F© 



When we perform a t-test, we test the hypothesis that the two samples have the same 
mean. Similarly, ANOVA tells us whether three or more means are the same, so it 
tests the null hypothesis that all group means are equal. An ANOVA produces an 
F-statistic or F-ratio, which is similar to the t-statistic in that it compares the amount 
of systematic variance in the data to the amount of unsystematic variance. In other 
words, F is the ratio of the model to its error. 

ANOVA is an omnibus test, which means that it tests for an overall effect: 
so, it does not provide specific information about which groups were 
affected. Suppose an experiment was conducted with three different groups, 
and the F-ratio tells us that the means of these three samples are not equal (i.e., that 
Xi = Xi = X3 is not true). There are several ways in which the means can differ. The 
first possibility is that all three sample means are significantly different (Xi ^ X2 ^ Xi)- 
A second possibility is that the means of groups 1 and 2 are the same but group 3 
has a significantly different mean from both of the other groups (Xi=XiJ=Xi)- 
Another possibility is that groups 2 and 3 have similar means but group 1 has a signifi¬ 
cantly different mean (Xi 3 ^X 2 = Xi)- Finally, groups 1 and 3 could have similar means 
but group 2 has a significantly different mean from both (Xi = X3 ^ X2 )• S°> in an experi¬ 
ment, the F-ratio tells us only that the experimental manipulation has had some effect, but 
it doesn’t tell us specifically what the effect was. 


10.2.3. 


ANOVA as regression © 


I’ve hinted several times that all statistical tests boil down to variants on regression. In fact, 
ANOVA is just a special case of regression. This surprises many scientists because ANOVA 
and regression are usually used in different situations. The reason is largely historical in that 


2 These comparisons are group 1 vs. 2, 1 vs. 3, 1 vs. 4, 1 vs. 5, 2 vs. 3, 2 vs. 4, 2 vs. 5, 3 vs. 4, 3 vs. 5 and 4 vs. 5. 
The number of tests required - let’s call it C - is calculated using this equation: 


2(k - 2)! 


in which k is the number of experimental conditions. The ! symbol stands for factorial, which means that you 
multiply the value preceding the symbol by all of the whole numbers between zero and that value (so 51=5x4 
X 3 X 2 X 1 = 120). Thus, with five conditions we find that: 


c = 


5! 

2(5-2)! 


120 

2x6 


= 10 
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two distinct branches of methodology developed in the sciences: correlational research and 
experimental research. Researchers interested in controlled experiments adopted ANOVA 
as their procedure of choice, whereas those looking for real-world relationships adopted 
multiple regression. As we all know, scientists are intelligent, mature and rational people, 
and so neither group was tempted to slag off the other and claim that their own choice 
of methodology was far superior to the other (yeah, right!). With the divide in meth¬ 
odologies came a chasm between the statistical methods adopted by the two opposing 
camps (Cronbach, 1957, documents this divide in a lovely article). This divide has lasted 
many decades, to the extent that now students are generally taught regression and ANOVA 
in very different contexts and many textbooks teach ANOVA and regression in entirely 
different ways. 

Although many considerably more intelligent people than me have attempted to redress 
the balance (notably the great Jacob Cohen, 1968), I am passionate about making my own 
small, feeble-minded attempt to enlighten you (and I set the ball rolling in sections 7.12 
and 9.4.2). There are several good reasons why I think ANOVA should be taught within 
the context of regression. First, it provides a familiar context: I wasted many trees trying 
to explain regression, so why not use this base of knowledge to explain a new concept? (It 
should make it easier to understand.) Second, the traditional method of teaching ANOVA 
(known as the variance-ratio method) is fine for simple designs, but becomes impossibly 
cumbersome in more complex situations (such as analysis of covariance). The regression 
model extends very logically to these more complex designs without anyone needing to 
get bogged down in mathematics. Finally, the variance-ratio method becomes extremely 
unmanageable in unusual circumstances such as when you have unequal sample sizes. 3 The 
regression method makes these situations considerably simpler. Although these reasons are 
good enough, it is also the case that R very much deals with ANOVA in a regression-y sort 
of way (known as the general linear model, or GLM). 

I have mentioned that ANOVA is a way of comparing the ratio of systematic variance to 
unsystematic variance in an experimental study. The ratio of these variances is known as 
the F-ratio. Flowever, any of you who have read Chapter 7 should recognize the F-ratio 
(see section 7.2.3) as a way to assess how well a regression model can predict an outcome 
compared to the error within that model. If you haven’t read Chapter 7 (surely not!), have 
a look before you carry on (it should only take you a couple of weeks to read). How can the 
F-ratio be used to test differences between means and whether a regression model fits the 
data? The answer is that when we test differences between means we are fitting a regression 
model and using F to see how well it fits the data, but the regression model contains only 
categorical predictors (i.e., grouping variables). So, just as the t-test could be represented 
by the linear regression equation (see section 9.4.2), ANOVA can be represented by the 
multiple regression equation in which the number of predictors is one less than the number 
of categories of the independent variable. 

Let’s take an example. There was a lot of controversy, when I wrote the first edition of 
my SPSS book, surrounding the drug Viagra. Admittedly there’s less controversy now, but 
the controversy has been replaced by an alarming number of spam emails on the subject (for 
which I’ll no doubt be grateful in 20 years’ time), so I’m going to stick with the example. 
Viagra is a sexual stimulant (used to treat impotence) that broke into the black market 
under the belief that it will make someone a better lover (oddly enough, there was a glut of 
journalists taking the stuff at the time in the name of ‘investigative journalism’... hmmm!). 
In the psychology literature, sexual performance issues have been linked to a loss of libido 
(Hawton, 1989). Suppose we tested this hypothesis by taking three groups of participants 
and administering one group with a placebo (such as a sugar pill), one group with a low 
dose of Viagra and one with a high dose. The dependent variable was an objective measure 



3 Having said this, it is well worth the effort in trying to obtain equal sample sizes in your different conditions 
because unbalanced designs do cause statistical complications (see section 10.3). 
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Table 10.1 Data in Viagra.dat 



Placebo 

Low Dose 

High Dose 


3 

5 

7 


2 

2 

4 


1 

4 

5 


1 

2 

3 


4 

3 

6 

X 

2.20 

3.20 

5.00 

s 

1.30 

1.30 

1.58 

s 2 

1.70 

1.70 

2.50 


Grand mean = 3.467 Grand SD = 

1.767 Grand variance 

= 3.124 


of libido (I will tell you only that it was measured over the course of a week - the rest I 
will leave to your own imagination). The data can be found in the file Viagra.dat (which is 
described in detail later in this chapter) and are in Table 10.1. 

If we want to predict levels of libido from the different levels of Viagra then we can use 
the general equation that keeps popping up: 

outcome, = (model) + error, 

If we want to use a linear model, then we saw in section 9.4.2 that when there are only 
two groups we could replace the ‘model’ in this equation with a linear regression equation 
with one dummy variable to describe two groups. This dummy variable was a categorical 
variable with two numeric codes (0 for one group and 1 for the other). With three groups, 
however, we can extend this idea and use a multiple regression model with two dummy 
variables. In fact, as a general rule we can extend the model to any number of groups and 
the number of dummy variables needed will be one less than the number of categories of 
the independent variable. In the two-group case, we assigned one category as a base cat¬ 
egory (remember that in section 9.4.2 we chose the picture condition to act as a base) and 
this category was coded 0. When there are three categories we also need a base category 
and you should choose the condition to which you intend to compare the other groups. 
Usually this category will be the control group. In most well-designed science experiments 
there will be a group of participants who act as a baseline for other categories. This base¬ 
line group should act as the reference or base category, although the group you choose will 
depend upon the particular hypotheses that you want to test. In unbalanced designs (in 
which the group sizes are unequal) it is important that the base category contains a fairly 
large number of cases to ensure that the estimates of the regression coefficients are reliable. 
In the Viagra example, we can take the placebo group as the base category because this 
group was a placebo control. We are interested in comparing both the high- and low-dose 
groups to the group that received no Viagra at all. If the placebo group is the base category 
then the two dummy variables that we have to create represent the other two conditions: 
so, we should have one dummy variable called high and the other one called low). The 
resulting equation is described as: 

libido,. = b 0 + b 2 high,. + b l low i + £,. (10.2) 

In equation (10.2), a person’s libido can be predicted from knowing their group code 
(i.e., the code for the high and low dummy variables) and the intercept ( b Q ) of the model. 
The dummy variables in equation (10.2) can be coded in several ways, but the simplest way 
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is to use a similar technique to that of the t-test. The base category is always coded 0. If a 
participant was given a high dose of Viagra then they are coded 1 for the high dummy vari¬ 
able and 0 for all other variables. If a participant was given a low dose of Viagra then they 
are coded 1 for the low dummy variable and 0 for all other variables (this is the same type 
of scheme we used in section 7.12). Using this coding scheme we can express each group 
by combining the codes of the two dummy variables (see Table 10.2). 


Table 10.2 Dummy coding for the three-group experimental design 



Placebo group: Let’s examine the model for the placebo group. In the placebo group both 
the high and low dummy variables are coded 0. Therefore, if we ignore the error term (e), 
the regression equation becomes: 


libido^ = b 0 + ( b 1 x 0) + (b 2 x 0 ) = b 0 


-Vplacebo — b 


0 


This is a situation in which the high- and low-dose groups have both been excluded (because 
they are coded with 0). We are looking at predicting the level of libido when both doses of 
Viagra are ignored, and so the predicted value will be the mean of the placebo group (because 
this group is the only one included in the model). Hence, the intercept of the regression 
model, b 0 , is always the mean of the base category (in this case the mean of the placebo group). 

High-dose group: If we examine the high-dose group, the dummy variable high will be 
coded 1 and the dummy variable low will be coded 0. If we replace the values of these 
codes into equation (10.2) the model becomes: 

libido, = b 0 + ( b l x 0) + ( b 2 x 1 ) = b 0 +b 2 

We know already that b 0 is the mean of the placebo group. If we are interested in only the 
high-dose group then the model should predict that the value of libido for a given partici¬ 
pant equals the mean of the high-dose group. Given this information, the equation becomes: 

libido, = b 0 +b 2 

Vhigh — Xplacebo ~L b 2 

b 2 — X high X placebo 

Hence, b 2 represents the difference between the means of the high-dose group and the 
placebo group. 

Low-dose group: Finally, if we look at the model when a low dose of Viagra has been 
taken, the dummy variable low is coded 1 (and hence high is coded as 0). Therefore, the 
regression equation becomes: 


libido, = b 0 + ( b l x 1) + (b 2 x 0 ) = b 0 +b 1 
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We know that the intercept is equal to the mean of the base category and that for the low- 
dose group the predicted value should be the mean libido for a low dose. Therefore the 
model can be reduced to: 


libido,- = 6 0 + b 1 

Xi ow — -Aplacebo "E 6j 
b | — X low X placebo 


Hence, b 1 represents the difference between the means of the low-dose group and the 
placebo group. This form of dummy variable coding is the simplest form, but as we will 
see later, there are other ways in which variables can be coded to test specific hypotheses. 
These alternative coding schemes are known as contrasts (see section 10.4.2). The idea 
behind contrasts is that you code the dummy variables in such a way that the 6-values rep¬ 
resent differences between groups that you are interested in testing. 




SELF-TEST 

s To illustrate exactly what is going on I have created a 
file called dummy.dat. This file contains the Viagra 
data but with two additional variables (dummyl 
and dummy2) that specify to which group a data 
point belongs (as in Table 10.2). Access this file 
and run multiple regression analysis using libido 
as the outcome and dummyl and dummy2 
as the predictors. If you’re stuck on how to run 
the regression then read Chapter 7 again (these 
chapters are ordered for a reason). 


The resulting analysis is shown in Output 10.1. It might be a good idea to remind your¬ 
self of the group means from Table 10.1. The first thing to notice is that, just as in the 
regression chapter, an ANOVA has been used to test the overall fit of the model. This test 
is significant, F( 2, 12) = 5.12, p < .05. Given that our model represents the group differ¬ 
ences, this ANOVA tells us that using group means to predict scores is significantly better 
than using the overall mean: in other words, the group means are significantly different. 

In terms of the regression coefficients, 6s, the constant is equal to the mean of the base 
category (the placebo group). The regression coefficient for the first dummy variable (6 2 ) 
is equal to the difference between the means of the high-dose group and the placebo group 
(5.0 — 2.2 = 2.8). Finally, the regression coefficient for the second dummy variable (6j) is 
equal to the difference between the means of the low-dose group and the placebo group 
(3.2 — 2.2 = 1). This analysis demonstrates how the regression model represents the three- 
group situation. We can see from the significance values of the t-tests that the difference 
between the high-dose group and the placebo group (6,) is significant because p < .05. 
The difference between the low-dose and the placebo group is not, however, significant 
(p = .282). 

Coefficients: 

Estimate Std. Error t value Pr(>|t|) 

(Intercept) 2.2000 0.6272 3.508 0.00432 ** 

dummyl 2.8000 0.8869 3.157 0.00827 ** 

dummy2 1.0000 0.8869 1.127 0.28158 
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Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 0.1 ' ' 1 

Residual standard error: 1.402 on 12 degrees of freedom 
Multiple R-squared: 0.4604, Adjusted R-squared: 0.3704 

F-statistic: 5.119 on 2 and 12 DF, p-value: 0.02469 

Output 10.1 

A four-group experiment can be described by extending the three-group scenario. I 
mentioned earlier that you will always need one less dummy variable than the number 
of groups in the experiment: therefore, this model requires three dummy variables. As 
before, we need to specify one category as a base category (a control group). This base 
category should have a code of 0 for all three dummy variables. The remaining three 
conditions will have a code of 1 for the dummy variable that described that condition 
and a code of 0 for the other two dummy variables. Table 10.3 illustrates how the coding 
scheme would work. 


Table 10.3 Dummy coding for the four-group experimental design 



Dummy 
variable 1 

Dummy 
variable 2 

Dummy 
variable 3 

Group 1 

1 

0 

0 

Group 2 

0 

1 

0 

Group 3 

0 

0 

1 

Group 4 (base) 

0 

0 

0 


10.2.4. 


Logic of the F-ratio (D 


In Chapter 7 we learnt a little about the F-ratio and its calculation. To recap, we learnt 
that the F-ratio is used to test the overall fit of a regression model to a set of observed 
data. In other words, it is the ratio of how good the model is compared to how bad it is 
(its error). I have just explained how ANOVA can be represented as a regression equation, 
and this should help you to understand what the F-ratio tells you about your data. Figure 
10.2 shows the Viagra data in graphical form (including the group means, the overall mean 
and the difference between each case and the group mean). In this example, there were 
three groups; therefore, we want to test the hypothesis that the means of three groups 
are different (so the null hypothesis is that the group means are the same). If the group 
means were all the same, then we would not expect the placebo group to differ from the 
low-dose group or the high-dose group, and we would not expect the low-dose group to 
differ from the high-dose group. Therefore, on the diagram, the three shaded blue lines 
would be in the same vertical position (the exact position would be the grand mean - the 
solid horizontal line in the figure). We can see from the diagram that the group means 
are actually different because the horizontal blue lines (the group means) are in different 
vertical positions. We have just found out that in the regression model, b 2 represents the 
difference between the means of the placebo and the high-dose group, and b 1 represents the 
difference in means between the placebo and the low-dose groups. These two distances are 
represented in Figure 10.2 by the vertical arrows. If the null hypothesis is true and all the 
groups have the same means, then these b coefficients should be zero (because if the group 
means are equal then the difference between them will be zero). 
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FIGURE 10.2 
The Viagra data 
in graphical form. 
The shaded blue 
horizontal lines 
represent the 
mean libido of 
each group. The 
shapes represent 
the libido of 
individual 
participants 
(different shapes 
indicate different 
experimental 
groups). The 
black horizontal 
line is the 
average libido of 
all participants 



Participant 


The logic of ANOVA follows from what we understand about regression: 

• The simplest model we can fit to a set of data is the grand mean (the mean of the out¬ 
come variable). This basic model represents ‘no effect’ or ‘no relationship between 
the predictor variable and the outcome’. 

• We can fit a different model to the data collected that represents our hypotheses. 
If this model fits the data well then it must be better than using the grand mean. 
Sometimes we fit a linear model (the line of best fit), but in experimental research we 
often fit a model based on the means of different conditions. 

• The intercept and one or more regression coefficients can describe the chosen model. 

• The regression coefficients determine the shape of the model that we have fitted; 
therefore, the bigger the coefficients, the greater the deviation between the line and 
the grand mean. 

• In correlational research, the regression coefficients represent the slope of the line, 
but in experimental research they represent the differences between group means. 

• The bigger the differences between group means, the greater the difference between 
the model and the grand mean. 

• If the differences between group means are large enough, then the resulting model 
will be a better fit of the data than the grand mean. 

• If this is the case we can infer that our model (i.e., predicting scores from the group 
means) is better than not using a model (i.e., predicting scores from the grand mean). 
Put another way, our group means are significantly different. 

Just like when we used ANOVA to test a regression model, we can compare the improve¬ 
ment in fit due to using the model (rather than the grand mean) to the error that still 
remains. Another way of saying this is that when the grand mean is used as a model, there 
will be a certain amount of variation between the data and the grand mean. When a model 
is fitted it will explain some of this variation, but some will be left unexplained. The F-ratio 
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is the ratio of the explained to the unexplained variation. Look back at section 7.2.3 to 
refresh you memory on these concepts before reading on. This may all sound quite com¬ 
plicated, but actually most of it boils down to variations on one simple equation (see Jane 
Superbrain Box 10.1). 


■ 



JANE SUPERBRAIN 10.1 

You might be surprised to know that ANOVA 
boils down to one equation (well, sort of) © 

At every stage of the ANOVA we're assessing variation 
(or deviance) from a particular model (be that the most 
basic model, or the most sophisticated model). We saw 
back in section 2.4.1 that the extent to which a model 
deviates from the observed data can be expressed, in 


general, in the form of equation (10.3). So, in ANOVA, as 
in regression, we use equation (10.3) to calculate the fit of 
the most basic model, and then the fit of the best model 
(the line of best fit). If the best model is any good then 
it should fit the data significantly better than our basic 
model: 

deviation = ^(observed- model) 2 (10.3) 

The interesting point is that all of the sums of squares in 
ANOVA are variations on this one basic equation. All that 
changes is what we use as the model, and what the cor¬ 
responding observed data are. Look through the various 
sections on the sums of squares and compare the result¬ 
ing equations to equation (10.3); hopefully, you can see 
that they are all basically variations on this general form 
of the equation! 


10.2.5. 


Total sum of squares (SS T ) © 


To find the total amount of variation within our data we calculate the difference between 
each observed data point and the grand mean. We then square these differences and add 
them together to give us the total sum of squares (SS T ): 

N 

~ X grand) (10.4) 

1=1 

We also saw in section 2.4.1 that the variance and the sums of squares are related such that 
variance, s 2 = SS/(N— 1), where N is the number of observations. Therefore, we can calcu¬ 
late the total sums of squares from the variance of all observations (the grand variance) by 
rearranging the relationship (SS = s 2 (N— 1)). The grand variance is the variation between all 
scores, regardless of the experimental condition from which the scores come. Figure 10.3 
shows the different sums of squares graphically (note the similarity to Figure 7.4 which 
we looked at when we learnt about regression). The top left panel shows the total sum of 
squares: it is the sum of the squared distances between each point and the solid horizontal 
line (which represents the mean of all scores). 

The grand variance for the Viagra data is given in Table 10.1, and if we count the number 
of observations we find that there were 15 in all. Therefore, SS T is calculated as follows: 


SS T — S g rand( W 1) 

= 3.124(15-1) = 3.124x14 = 43.74 
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FIGURE 10.3 

Graphical 
representation 
of the different 
sums of squares 
in ANOVA 
designs 


Before we move on, it is important to understand degrees of freedom, so have a look 
back at Jane Superbrain Box 2.2 to refresh your memory. We saw before that when we 
estimate population values, the degrees of freedom are typically one less than the number 
of scores used to calculate the population value. This is because to get these estimates we 
have to hold something constant in the population (in this case the mean), which leaves 
all but one of the scores free to vary (see Jane Superbrain Box 2.2). For SS^ we used the 
entire sample (i.e., 15 scores) to calculate the sums of squares and so the total degrees 
of freedom (df T ) are one less than the total sample size (N - 1). For the Viagra data, this 
value is 14. 
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10 . 2 . 6 . 


Model sum of squares (SS M ) © 


So far, we know that the total amount of variation within the data is 43.74 units. We now 
need to know how much of this variation the regression model can explain. In the ANOVA 
scenario, the model is based upon differences between group means, and so the model 
sums of squares tell us how much of the total variation can be explained by the fact that 
different data points come from different groups. 

In section 7.2.3 we saw that the model sum of squares is calculated by taking the differ¬ 
ence between the values predicted by the model and the grand mean (see Figure 7.4). In 
ANOVA, the values predicted by the model are the group means (the dashed horizontal 
lines in Figure 10.3). The bottom panel in Figure 10.3 shows the model sum of squared 
error: it is the sum of the squared distances between what the model predicts for each data 
point (i.e., the dotted horizontal line for the group to which the data point belongs) and 
the overall mean of the data (the solid horizontal line). 

For each participant the value predicted by the model is the mean for the group to which 
the participant belongs. In the Viagra example, the predicted value for the five participants 
in the placebo group will be 2.2, for the five participants in the low-dose condition it will 
be 3.2, and for the five participants in the high-dose condition it will be 5. The model sum 
of squares requires us to calculate the differences between each participant’s predicted 
value and the grand mean. These differences are then squared and added together (for 
reasons that should be clear in your mind by now). We know that the predicted value for 
participants in a particular group is the mean of that group. Therefore, the easiest way to 
calculate SS M is to do the following: 


1 Calculate the difference between the mean of each group and the grand mean. 

2 Square each of these differences. 

3 Multiply each result by the number of participants within that group ( n k ). 

4 Add the values for each group together. 

The mathematical expression for this process is: 

k 

SS M =£”*(** -*g»nd) 2 (10.5) 

n=1 


Using the means from the Viagra data, we can calculate SS M as follows: 

SS M = 5(2.200 - 3.467) 2 + 5(3.200 - 3.467) 2 + 5(5.00 - 3.467) 2 
= 5(-1.267) 2 + 5(-0.267) 2 + 5(1.533) 2 
= 8.025 + 0.335 + 11.755 
= 20.135 

For SS M , the degrees of freedom (df M ) will always be one less than the number of param¬ 
eters estimated. In short, this value will be the number of groups minus one (which you’ll 
see denoted as k — 1). So, in the three-group case the degrees of freedom will always be 2 
(because the calculation of the sums of squares is based on the group means, two of which 
will be free to vary in the population if the third is held constant). 
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10 . 2 . 7 . 


Residual sum of squares (SS D ) (D 

K 


We now know that there are 43.74 units of variation to be explained in our data, and that 
our model can explain 20.14 of these units (nearly half). The final sum of squares is the 
residual sum of squares (SS R ), which tells us how much of the variation cannot be explained 
by the model. This value is the amount of variation caused by extraneous factors such as 
individual differences in weight, testosterone or whatever. Knowing SS T and SS M already, 
the simplest way to calculate SS R is to subtract SS M from SS T (SS R = SS T — SS M ); however, 
telling you to do this provides little insight into what is being calculated and, of course, if 
you’ve messed up the calculations of either SS M or SS T (or indeed both!) then SS R will be 
incorrect also. 

We saw in section 7.2.3 that the residual sum of squares is the difference between what 
the model predicts and what was actually observed. In ANOVA, the values predicted 
by the model are the group means (the dashed horizontal lines in Figure 10.3). The top 
left panel shows the residual sum of squared error: it is the sum of the squared distances 
between each point and the dotted horizontal line for the group to which the data point 
belongs. 

We already know that for a given participant, the model predicts the mean of the group 
to which that person belongs. Therefore, SS R is calculated by looking at the difference 
between the score obtained by a person and the mean of the group to which the person 
belongs. In graphical terms the vertical lines in Figure 10.3 represent this sum of squares. 
These distances between each data point and the group mean are squared and then added 
together to give the residual sum of squares, SS R , thus: 

n 

SSr = ^j( x ik~ x k) (10.6) 

i=l 

Now, the sum of squares for each group represents the sum of squared differences 
between each participant’s score in that group and the group mean. Therefore, we can 
express SS D as SS D = SS . + SS , + SS , + .... Given that we know the relationship 
between the variance and the sums of squares, we can use the variances for each group in 
the Viagra data to create an equation like we did for the total sum of squares. As such, SS R 
can be expressed as: 

SS R = -!) (10.7) 

This just means take the variance from each group (s^) and multiply it by one less than the 
number of people in that group ( n k — 1). When you’ve done this for each group, add them 
all up. For the Viagra data, this gives us: 

SSr — S groupl( W l ~ 1) + S group2( W 2 ~ 1) + S group3 ( W 3 ~ 

= (1.70)(5 -1) + (1.70)(5 -1) + (2.50) (5 -1) 

= (1.70 x 4) + (1.70 x 4) + (2.50 x 4) 

= 6.8 + 6.8 + 10 

= 23.60 

The degrees of freedom for SS R ( df R ) are the total degrees of freedom minus the degrees 
of freedom for the model (df R = df T ~df M = 14 — 2 = 12). Put another way, it’s N — k: the 
total sample size, N, minus the number of groups, k. 
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10 . 2 . 8 . 


Mean squares © 


SS M tells us the total variation that the regression model (e.g., the experimental manip¬ 
ulation) explains, and SS R tells us the total variation that is due to extraneous factors. 
However, because both of these values are summed values they will be influenced by the 
number of scores that were summed; for example, SS M used the sum of only 3 different 
values (the group means) compared to SS R and SS^, which used the sum of 12 and 15 val¬ 
ues, respectively. To eliminate this bias we can calculate the average sum of squares (known 
as the mean squares, MS), which is simply the sum of squares divided by the degrees of 
freedom. The reason why we divide by the degrees of freedom rather than the number of 
parameters used to calculate the SS is because we are trying to extrapolate to a population 
and so some parameters within that populations will be held constant (this is the same rea¬ 
son why we divide by N — 1 when calculating the variance; see Jane Superbrain Box 2.2). 
So, for the Viagra data we find the following mean squares: 


ms m 

ms r 


SS M _ 20.135 
d fu 2 
SS R 23.60 
df R 12 


= 10.067 

1.967 


MS m represents the average amount of variation explained by the model (e.g., the system¬ 
atic variation), whereas MS R is a gauge of the average amount of variation explained by 
extraneous variables (the unsystematic variation). 


10 . 2 . 9 . 


The F-ratio © 


The F-ratio is a measure of the ratio of the variation explained by the model and the varia¬ 
tion explained by unsystematic factors. In other words, it is the ratio of how good the 
model is against how bad it is (how much error there is). It can be calculated by dividing 
the model mean squares by the residual mean squares. 


ms r 


( 10 . 8 ) 


As with the independent t-test, the F-ratio is, therefore, a measure of the ratio of sys¬ 
tematic variation to unsystematic variation. In experimental research, it is the ratio of the 
experimental effect to the individual differences in performance. An interesting point about 
the F-ratio is that because it is the ratio of systematic variance to unsystematic variance, if its 
value is less than 1 then it must, by definition, represent a non-significant effect. The reason 
why this statement is true is because if the F-ratio is less than 1 it means that MS R is greater 
than MS m , which in real terms means that there is more unsystematic than systematic vari¬ 
ance. You can think of this in terms of the effect of natural differences in ability being greater 
than differences brought about by the experiment. In this scenario, we can, therefore, be 
sure that our experimental manipulation has been unsuccessful (because it has brought 
about less change than if we left our participants alone!). For the Viagra data, the F-ratio is: 

r MS m 10,067 5 12 

MS r 1.967 
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This value is greater than 1, which indicates that the experimental manipulation had some 
effect above and beyond the effect of individual differences in performance. However, it 
doesn’t yet tell us whether the F-ratio is large enough to not be a chance result. To discover 
this we can compare the obtained value of F against the maximum value we would expect 
to get by chance if the group means were equal in an F-distribution with the same degrees 
of freedom (these values can be found in the Appendix); if the value we obtain exceeds this 
critical value we can be confident that this reflects an effect of our independent variable 
(because this value would be very unlikely if there were no effect in the population). In this 
case, with 2 and 12 degrees of freedom the critical values are 3.89 (p = .05) and 6.93 (p = 
.01). The observed value, 5.12, is, therefore, significant at a .05 level of significance but 
not significant at a .01 level. The exact significance produced by R should, therefore, fall 
somewhere between .05 and .01 (which, incidentally, it does). 


10.3. Assumptions of AN0VA © 


The assumptions under which the F-statistic is reliable are the same as for all parametric 
tests based on the normal distribution (see section 5.2). That is, the variances in each experi¬ 
mental condition need to be fairly similar (homogeneity of variance), observations should be 
independent and the dependent variable should be measured on at least an interval scale. In 
terms of normality, what matters is that distributions within groups are normally distributed. 


10 . 3 . 1 . 


Homogeneity of variance © 


As with the t-test, there is an assumption that the variances of the groups are equal. This 
assumption can be tested using Levene’s test, which tests the null hypothesis that the vari¬ 
ances of the groups are the same (see section 5.7.1). Basically, it is an ANOVA test con¬ 
ducted on the absolute differences between the observed data and the mean or median 
from which the data came (see Oliver Twisted). If Levene’s test is significant (i.e., the 
p-value is less than .05) then we can say that the variances are significantly different. This 
would mean that we had violated one of the assumptions of ANOVA and we would have 
to take steps to rectify this matter. 


OLIVER TWISTED ‘Liar! Liar! Pants on fire! ’ screams Oliver, his cheeks red and his 

eyes about to explode, ‘You promised in Chapter 5 to explain 
Please Sit, can I have some Levene’s test properly and you haven’t, you spatula head’. True 
more ... Levenes test? enough, Oliver, I do have a spatula for a head. I also have a very 

nifty little demonstration of Levene’s test in the additional mate¬ 
rial for this chapter on the companion website. It will tell you more 
than you could possibly want to know. Let’s go fry an egg ... 


10 . 3 . 2 . 


Is ANOVA robust? © 


You often hear people say ANOVA is a robust test’, which means that it doesn’t matter 
much if we break the assumptions of the test: the F-ratio will still be accurate. There is 
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some truth to this statement, but it is also an oversimplification of the situation. 

For one thing, the term ANOVA covers many different situations and the perfor¬ 
mance of F has been investigated in only some of those situations. There are two 
issues to consider. First, does F control the Type I error rate or is it significant 
even when there are no differences between means? Second, does F have enough 
power (i.e., is it able to detect differences when they are there)? Let’s have a look 
at the evidence. 

Looking at normality first, Glass et al. (1972) reviewed a lot of evidence that 
suggests that F controls the Type I error rate well under conditions of skew, kurto- 
sis and non-normality. Skewed distributions seem to have little effect on the error 
rate and power for two-tailed tests (but can have serious consequences for one- 
tailed tests). Flowever, some of this evidence has been questioned (see Jane Superbrain Box 
5.1). In terms of kurtosis, leptokurtic distributions make the Type I error rate too low (too 
many null effects are significant) and consequently the power is too high; platykurtic distri¬ 
butions have the opposite effect. The effects of kurtosis seem unaffected by whether sample 
sizes are equal or not. One study that is worth mentioning in a bit of detail is by Lunney 
(1970) who investigated the use of ANOVA in just about the most non-normal situation 
you could imagine: when the dependent variable is binary (it could have values of only 0 or 
1). The results showed that when the group sizes were equal, ANOVA was accurate when 
there were at least 20 degrees of freedom and the smallest response category contained at 
least 20% of all responses. If the smaller response category contained less than 20% of all 
responses then ANOVA performed accurately only when there were 40 or more degrees 
of freedom. The power of F also appears to be relatively unaffected by non-normality 
(Donaldson, 1968). This evidence suggests that when group sizes are equal the F-statistic 
can be quite robust to violations of normality. 

However, when group sizes are not equal the accuracy of F is affected by skew, and 
non-normality also affects the power off in quite unpredictable ways (Wilcox, 2005). One 
situation that Wilcox describes shows that when means are equal the error rate (which 
should be 5%) can be as high as 18%. If you make the differences between means bigger 
you should find that power increases, but actually he found that initially power decreased 
(although it increased when he made the group differences bigger still). As such F can be 
biased when normality is violated. 

Turning to violations of the assumption of homogeneity of variance, ANOVA is fairly 
robust in terms of the error rate when sample sizes are equal. However, when sample 
sizes are unequal, ANOVA is not robust to violations of homogeneity of variance (this is 
why earlier on I said it’s worth trying to collect equal-sized samples of data across condi¬ 
tions!). When groups with larger sample sizes have larger variances than the groups with 
smaller sample sizes, the resulting F-ratio tends to be conservative. That is, it’s more likely 
to produce a non-significant result when a genuine difference does exist in the popula¬ 
tion. Conversely, when the groups with larger sample sizes have smaller variances than 
the groups with smaller samples sizes, the resulting F-ratio tends to be liberal. That is, it 
is more likely to produce a significant result when there is no difference between groups 
in the population (put another way, the Type I error rate is not controlled) - see Glass et al. 
(1972) for a review. When variances are proportional to the means then the power of 
F seems to be unaffected by the heterogeneity of variance and trying to stabilize variances 
does not substantially improve power (Budescu, 1982; Budescu &C Appelbaum, 1981). 

Violations of the assumption of independence are very serious indeed. Scariano and 
Davenport (1987) showed that when this assumption is broken (i.e., observations across 
groups are correlated) then the Type I error rate is substantially inflated. For example, 
using the conventional .05 Type I error rate when observations are independent, if these 
observations are made to correlate moderately (say, with a Pearson coefficient of .5), when 
comparing three groups, each of 10 observations, the actual Type I error rate is .74 (a 
substantial inflation!). Therefore, if observations are correlated you might think that you 
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are working with the accepted .05 error rate (i.e., you’ll incorrectly find a significant result 
only 5% of the time) when in fact your error rate is closer to .75 (i.e., you’ll find a signifi¬ 
cant result on 75% of occasions when, in reality, there is no effect in the population). 

There are various things that can be done to combat the litany of woe that you have just 
read. To find out more see Jane Superbrain Box 10.2. 



JANE SUPERBRAIN 10.2 

What do I do in ANOVA when assumptions are 
broken? ® 

As we saw in Chapter 5, one common way to rectify prob¬ 
lems with assumptions is to transform all of the data and 
then reanalyse these transformed values (see Chapter 5). 
When homogeneity of variance is the problem there are ver¬ 
sions of the F-ratio that have been derived to be robust when 
homogeneity of variance has been violated. One that can be 
implemented in R is Welch’s F (1951) - see Oliver Twisted. 


If you have distributional problems, then there are 
robust (see section 5.8.4) variants of ANOVA that have 
been implemented in R by Wilcox (2005). These meth¬ 
ods are based on bootstrapping or trimmed means 
and M-estimators (both of which can also include a 
bootstrap). We’ll cover these methods later in the 
chapter. 

On balance, if you have the stomach for it, Wilcox's 
robust methods are probably the best approach to deal¬ 
ing with violations of assumptions. If you don’t have the 
stomach for it, there are a group of tests (often called 
assumption-free, distribution-free or non-parametric 
tests, none of which are particularly accurate names) 
that you can use instead. The one-way independent 
ANOVA has a non-parametric counterpart called the 
Kruskal-Wallis test. If you have non-normally distributed 
data, or have violated some other assumption, then this 
test can be a useful way around the problem. This test is 
described in Chapter 15. 


® OLIVER TWISTED ‘You cion ' t understand Welch’s F,’ taunts Oliver, ‘Andy, Andy, brains 

all sandy ....’ Whatever, Oliver. Welch's F adjusts F and the residual 
Please Sir, can I have some degrees of freedom to combat problems arising from violations of 
more ... Welch’s F? the homogeneity of variance assumption. There is a lengthy expla¬ 

nation about Welch's F in the additional material available on the 
companion website. Oh, and Oliver, microchips are made of sand. 


10.4. Planned contrasts © 


The F-ratio tells us only whether the model fitted to the data accounts for more varia¬ 
tion than extraneous factors, but it doesn’t tell us where the differences between groups 
lie. So, if the F-ratio is large enough to be statistically significant, then we know only that 
one or more of the differences between means are statistically significant (e.g., either b 2 
or b 1 is statistically significant). It is, therefore, necessary after conducting an ANOVA to 
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carry out further analysis to find out which groups differ. In multiple regression, each 
b coefficient is tested individually using a V-test and we could do the same for ANOVA. 
However, we would need to carry out two V-tests, which would inflate the familywise 
error rate (see section 10.2). Therefore, we need a way to contrast the different groups 
without inflating the Type I error rate. There are two ways in which to achieve this goal: 
the first is to break down the variance accounted for by the model into component parts: 
the second is to compare every group (as if conducting several V-tests) but to use a stricter 
acceptance criterion such that the familywise error rate does not rise above .05. The 
first option can be done using planned comparisons (also known as planned contrasts), 4 
whereas the latter option is done using post hoc comparisons (see next section). The dif¬ 
ference between planned comparisons and post hoc tests can be likened to the difference 
between one- and two-tailed tests in that planned comparisons are done when you have 
specific hypotheses that you want to test, whereas post hoc tests are done when you have 
no specific hypotheses. Let’s first look at planned contrasts. 


10 . 4 . 1 . 


Choosing which contrasts to do <D 


In the Viagra example we could have had very specific hypotheses. For one thing, we 
would expect any dose of Viagra to change libido compared to the placebo group. As a 
second hypothesis we might believe that a high dose should increase libido more than a 
low dose. To do planned comparisons, these hypotheses must be derived before the data are 
collected. It is fairly standard in science to want to compare experimental conditions to the 
control conditions as the first contrast, and then to see where the differences lie between 
the experimental groups. ANOVA is based upon splitting the total variation into two com¬ 
ponent parts: the variation due to the experimental manipulation (SS M ) and the variation 
due to unsystematic factors (SS R ) (see Figure 10.4). 



FIGURE 10.4 

Partitioning 
variance for 
ANOVA 


Planned comparisons take this logic a step further by breaking down the variation due 
to the experiment into component parts (see Figure 10.5). The exact comparisons that are 
carried out depend upon the hypotheses you want to test. Figure 10.5 shows a situation in 
which the experimental variance is broken down to look at how much variation is created 
by the two drug conditions compared to the placebo condition (contrast 1). Then the varia¬ 
tion explained by taking Viagra is broken down to see how much is explained by taking a 
high dose relative to a low dose (contrast 2). 

4 The terms comparison and contrast are used interchangeably. 
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FIGURE 10.5 

Partitioning of 
experimental 
variance into 
component 
comparisons 
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Typically, students struggle with the notion of planned comparisons, but there are three 
rules that can help you to work out what to do: 

1 If we have a control group, this is usually because we want to compare it against the 
other groups. 

2 Each contrast must compare only two ‘chunks’ of variation. 

3 Once a group has been singled out in a contrast it can’t be used in another contrast. 

Let’s look at these rules in detail. First, if a group is singled out in one comparison, then 
it should not reappear in another comparison. The important thing to remember is that we 
are breaking down one chunk of variation into smaller independent chunks. So, in Figure 
10.5 contrast 1 involved comparing the placebo group to the experimental groups; because 
the placebo group is singled out, it should not be incorporated into any other contrasts. 
You can think of partitioning variance as being similar to slicing up a cake. You begin with 
a cake (the total sum of squares) and you then cut this cake into two pieces (SS M and SS R ). 
You then take the piece of cake that represents SS M and divide this up into smaller pieces. 
Once you have cut off a piece of cake you cannot stick that piece back onto the original 
slice, and you cannot stick it onto other pieces of cake, but you can divide it into smaller 
pieces of cake. Likewise, once a slice of variance has been split from a larger chunk, it 
cannot be attached to any other pieces of variance, it can only be subdivided into smaller 
chunks of variance. Now, all of this talk of cake is making me hungry, but hopefully it 
illustrates a point. 

If you follow the independence of contrasts rule that I’ve just explained (the cake sli¬ 
cing!), and always compare only two pieces of variance, then you should always end up 
with one less contrast than the number of groups; there will be k — 1 contrasts (where k is 
the number of conditions you’re comparing). 

Second, each contrast must compare only two chunks of variance. This rule is so that we 
can draw firm conclusions about what the contrast tells us. The F-ratio tells us that some of 
our means differ, but not which ones, and if we were to perform a contrast on more than 
two chunks of variance we would have the same problem. By comparing only two chunks 
of variance we can be sure that a significant result represents a difference between these 
two portions of experimental variation. 
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Finally, in most social science research we use at least one control condition, and in the 
vast majority of experimental designs we predict that the experimental conditions will dif¬ 
fer from the control condition (or conditions). As such, the biggest hint that I can give you 
is that when planning comparisons the chances are that your first contrast should be one 
that compares all of the experimental groups with the control group (or groups). Once you 
have done this first comparison, any remaining comparisons will depend upon which of 
the experimental groups you predict will differ. 

To illustrate these principles, Figures 10.6 and 10.7 show the contrasts that might be 
done in a four-group experiment. The first thing to notice is that in both scenarios there 
are three possible comparisons (one less than the number of groups). Also, every contrast 
compares only two chunks of variance. What’s more, in both scenarios the first contrast is 
the same: the experimental groups are compared against the control group or groups. In 
Figure 10.6 there was only one control condition and so this portion of variance is used 
only in the first contrast (because it cannot be broken down any further). In Figure 10.7 
there were two control groups, and so the portion of variance due to the control conditions 
(contrast 1) can be broken down again so as to see whether or not the scores in the control 
groups differ from each other (contrast 3). 

In Figure 10.6, the first contrast contains a chunk of variance that is due to the three 
experimental groups and this chunk of variance is broken down by first looking at whether 
groups El and E2 differ from E3 (contrast 2). It is equally valid to use contrast 2 to com¬ 
pare groups El and E3 to E2, or to compare groups E2 and E3 to El. The exact compari¬ 
son that you choose depends upon your hypotheses. For contrast 2 in Figure 10.6 to be 
valid we need to have a good reason to expect group E3 to be different from the other 
two groups. The third comparison in Figure 10.6 depends on the comparison chosen for 
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FIGURE 10.6 

Partitioning 
variance 
for planned 
comparisons 
in a four-group 
experiment using 
one control 
group 
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FIGURE 10.7 

Partitioning 
variance 
for planned 
comparisons 
in a four-group 
experiment using 
two control 
groups 
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Contrast 2 


Contrast 3 



contrast 2. Contrast 2 necessarily had to involve comparing two experimental 
groups against a third, and the experimental groups chosen to be combined 
must be separated in the final comparison. As a final point, you’ll notice that 
in Figures 10.6 and 10.7, once a group has been singled out in a comparison, 
it is never used in any subsequent contrasts. 

When we carry out a planned contrast we compare ‘chunks’ of variance, 
and these chunks often consist of several groups. It is perhaps confusing to 
understand exactly what these contrasts tell us. Well, when you design a 
contrast that compares several groups to one other group, you are compar¬ 
ing the means of the groups in one chunk with the mean of the group in 
the other chunk. As an example, for the Viagra data I suggested that an appropriate first 
contrast would be to compare the two dose groups with the placebo group. The means of 
the groups are 2.20 (placebo), 3.20 (low dose) and 5.00 (high dose) and so the first com¬ 
parison, which compared the two experimental groups to the placebo, is comparing 2.20 
(the mean of the placebo group) to the average of the other two groups ((3.20 + 5.00)/2 
= 4.10). If this first contrast turns out to be significant, then we can conclude that 4.10 
is significantly greater than 2.20, which in terms of the experiment tells us that the aver¬ 
age of the experimental groups is significantly different from the average of the controls. 
You can probably see that logically this means that, if the standard errors are the same, 
the experimental group with the highest mean (the high-dose group) will be significantly 
different from the mean of the placebo group. However, the experimental group with the 
lower mean (the low-dose group) might not necessarily differ from the placebo group; 
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we have to use the final comparison to make sense of the experimental conditions. For 
the Viagra data the final comparison looked at whether the two experimental groups dif¬ 
fer (i.e., is the mean of the high-dose group significantly different from the mean of the 
low-dose group?). If this comparison turns out to be significant then we can conclude that 
having a high dose of Viagra significantly affected libido compared to having a low dose. 
If the comparison is non-significant then we have to conclude that the dosage of Viagra 
made no significant difference to libido. In this latter scenario it is likely that both doses 
affect libido more than placebo, whereas the former case implies that having a low dose 
may be no different than having a placebo. However, the word implies is important here: 
it is possible that the low-dose group might not differ from the placebo. To be completely 
sure we must carry out post hoc tests. 


10 . 4 . 2 . 


Defining contrasts using weights © 


Hopefully by now you have got some idea of how to plan which comparisons to do (i.e., if 
your brain hasn’t exploded by now). Much as I’d love to tell you that all of the hard work 
is now over and R will magically carry out the comparisons that you’ve selected, it won’t. 
To get R to carry out planned comparisons we need to tell it which groups we would like 
to compare, and doing this can be quite complex. In fact, when we carry out contrasts we 
assign values to certain variables in the regression model (sorry, I’m afraid that I have to 
start talking about regression again) - just as we did when we used dummy coding for the 
main ANOVA. To carry out contrasts we assign certain values to the dummy variables in 
the regression model. Whereas before we defined the experimental groups by assigning the 
dummy variables values of 1 or 0, when we perform contrasts we use different values to 
specify which groups we would like to compare. The resulting coefficients in the regres¬ 
sion model (b, and bj represent the comparisons in which we are interested. The values 
assigned to the dummy variables are known as weights. 

This procedure can seem horribly confusing, but there are a few basic rules for assigning 
values to the dummy variables to obtain the comparisons you want. I will explain these 
simple rules before showing how the process actually works. Remember the previous sec¬ 
tion when you read through these rules, and remind yourself of what I mean by a ‘chunk’ 
of variation. 

• Rule 1: Choose sensible comparisons. Remember that you want to compare only two 
chunks of variation and that if a group is singled out in one comparison, that group 
should be excluded from any subsequent contrasts. 

• Rule 2: Groups coded with positive weights will be compared against groups coded 
with negative weights. So, assign one chunk of variation positive weights and the 
opposite chunk negative weights. 

• Rule 3: The sum of weights for a comparison should be zero. If you add up the 
weights for a given contrast the result should be zero. 

• Rule 4: If a group is not involved in a comparison, automatically assign it a weight of 
0. If we give a group a weight of 0 then this eliminates that group from all calculations. 

• Rule 5: For a given contrast, the weights assigned to the group(s) in one chunk of 
variation should be equal to the number of groups in the opposite chunk of variation. 

OK, let’s follow some of these rules to derive the weights for the Viagra data. The first 
comparison we chose was to compare the two experimental groups against the control: 
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>- Contrast 1 


Therefore, the first chunk of variation contains the two experimental groups, and the 
second chunk contains only the placebo group. Rule 2 states that we should assign one 
chunk positive weights, and the other negative. It doesn’t matter which way round we 
do this, but for convenience let’s assign chunk 1 positive weights, and chunk 2 negative 
weights: 
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Using rule 5, the weight we assign to the groups in chunk 1 should be equivalent to the 
number of groups in chunk 2. There is only one group in chunk 2 and so we assign each 
group in chunk 1 a weight of 1. Likewise, we assign a weight to the group in chunk 2 that is 
equal to the number of groups in chunk 1. There are two groups in chunk 1 so we give the 
placebo group a weight of 2. Then we combine the sign of the weights with the magnitude 
to give us weights of —2 (placebo), 1 (low dose) and 1 (high dose): 
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Rule 3 states that for a given contrast, the weights should add up to zero, and by following 
rules 2 and 5 this rule will always be followed (if you haven’t followed these rules prop¬ 
erly then this will become clear when you add the weights). So, let’s check by adding the 
weights: sum of weights = 1 + 1 —2 = 0. 

The second contrast was to compare the two experimental groups and so we want to 
ignore the placebo group. Rule 4 tells us that we should automatically assign this group a 
weight of 0 (because this will eliminate this group from any calculations). We are left with 
two chunks of variation: chunk 1 contains the low-dose group and chunk 2 contains the 
high-dose group. By following rules 2 and 5 it should be obvious that one group is assigned 
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a weight of +1 while the other is assigned a weight of —1. The control group is ignored 
(and so given a weight of 0). If we add the weights for contrast 2 we should find that they 
again add up to zero: sum of weights =1 — 1 + 0 = 0. 
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> Contrast 2 
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Not in 
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Sign of Weight 


1 

1 

Magnitude 

0 

-► +1 < - 

-1 -* — 

Weight 

—► 0 


The weights for each contrast are codings for the two dummy variables in equation 
(10.2). Hence, these codings can be used in a multiple regression model in which b 2 repre¬ 
sents contrast 1 (comparing the experimental groups to the control), b l represents contrast 
2 (comparing the high-dose group to the low-dose group), and b Q is the grand mean: 

libido,. = b 0 + fejContrastj, + (? 2 contrast 2i (10.9) 

Each group is specified now not by the 0 and 1 coding scheme that we initially used, but 
by the coding scheme for the two contrasts. A code of —2 for contrast 1 and a code of 0 
for contrast 2 identifies participants in the placebo group. Likewise, the high-dose group 
is identified by a code of 1 for both variables, and the low-dose group has a code of 1 for 
one contrast and a code of —1 for the other (see Table 10.4). 


Table 10.4 Orthogonal contrasts for the Viagra data 


Group 

Dummy variable 1 
(contrast) 

Dummy variable 2 
(contrast 2 ) 

Product 

contrast 1 x contrast 2 

Placebo 

-2 

0 

0 

Low dose 

1 

-i 

-i 

High dose 

1 

i 

i 

Total 

0 

0 

0 


It is important that the weights for a comparison sum to zero because it ensures that you 
are comparing two unique chunks of variation. Therefore, we can perform a t-test. A more 
important consideration is that when you multiply the weights for a particular group, these 
products should also add up to zero (see the final column of Table 10.4). If the products add 
to zero then we can be sure that the contrasts are independent or orthogonal. It is important 
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for interpretation that contrasts are orthogonal. When we used dummy vari¬ 
able coding and ran a regression on the Viagra data, I commented that we 
couldn’t look at the individual t-tests done on the regression coefficients 
because the familywise error rate is inflated (see section 10.4). However, if the 
contrasts are independent then the t-tests done on the b coefficients are also 
independent and so the resulting p-values are uncorrelated. You might think 
that it is very difficult to ensure that the weights you choose for your contrasts 
conform to the requirements for independence but, provided you follow the 
rules I have laid out, you should always derive a set of orthogonal compari¬ 
sons. You should double-check by looking at the sum of the multiplied weights 
and if this total is not zero then go back to the rules and see where you have gone wrong (see 
the final column of Table 10.4). 

Earlier on, I mentioned that when you used contrast codings in dummy variables in 
a regression model the b-values represented the differences between the means that the 
contrasts were designed to test. Although it is reasonable for you to trust me on this issue, 
for the more advanced students I’d like to take the trouble to show you how the regression 
model works (this next part is not for the faint-hearted and so those with an equation pho¬ 
bia should move onto the next section!). When we do planned contrasts, the intercept b Q is 
equal to the grand mean (i.e., the value predicted by the model when group membership is 
not known), which when group sizes are equal is: 


b 0 = grand mean = 


Xhigh + Xl ow + X placebo 


Placebo group: If we use the contrast codings for the placebo group (see Table 10.4), the 
predicted value of libido equals the mean of the placebo group. The regression equation 
can, therefore, be expressed as: 


libido; = b 0 + fejContrastj + (? 2 contrast 2 

+ {—2b 1 ) + (b 1 xO) 


X 


X high Xlow X placebo 


placebo 


Now, if we rearrange this equation and then multiply everything by 3 (to get rid of the 
fraction) we get: 


( 


2^ _ X^igi, Xi ow + Xp] ace b Q ^ 

1 ~ 3 

v y 

bb[ Xhigh X low + X phicch 0 3X placebo 

= X hig h + X low — 2X plac(;bo 


placebo 


We can then divide everything by 2 to reduce the equation to its simplest form: 


3 b 1 = 
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This equation shows that b 1 represents the difference between the average of the two 
experimental groups and the control group: 


3 b l = 


v i v A 

^high T ^low 


V 

_ 5 + 3.2 
2 

= 1.9 


-X 


placebo 


- 2.2 


We planned contrast 1 to look at the difference between the average of the experimental 
groups and the control, and so it should now be clear how b 1 represents this difference. 
The observant among you will notice that rather than being the true value of the difference 
between experimental and control groups, b 1 is actually a third of this difference ( b 1 = 1.9/3 
= 0.633). The reason for this division is that the familywise error is controlled by making 
the regression coefficient equal to the actual difference divided by the number of groups in 
the contrast (in this case 3). 

High-dose group: For the situation in which the codings for the high-dose group (see 
Table 10.4) are used, the predicted value of libido is the mean for the high-dose group, and 
so the regression equation becomes: 


libido; = b 0 + fejContrastj + b 2 contrast 2 

X hlgh =&o + (^xl) + (b 2 xl) 

K =X hlgh -b 1 -/7 0 


We know already what b 1 and b 0 represent, so we place these values into the equation and 
then multiply by 3 to get rid of some of the fractions: 


/? 2 — Xhigh hj bg 


— [l 

b 2 = Xhigh — t — 

Xhigh “1" Xlow 

9 

-*_] 

}- 

Xhigh "4“ Xlow “1“ Xplacebo 

l 3 i 

l 1 ) 



K 5 ) 


3 b 2 = 3Xhi g h — 


Xhigh "4“ Xlow 

Xplacebo 

9 

Lv 1 J 



(Xhigh + Xlow + Xplacebo ) 


If we multiply everything by 2 to get rid of the other fraction, expand all of the brackets 
and then simplify the equation we get: 


6b 2 — 6 X high (Xhigh + Xlow 2 Xplacebo ) 2(Xhigh + X] ow “1“ X placebo ) 

— 6Xhigh X^high Xl ow "4“ 2X^placebo 2 X high 2.Xlow 2 X placebo 

= 3Xhi g h — 3Xl ow 


Finally, we can divide the equation by 6 to find out what b 2 represents (remember that 
3/6 = 1/2): 

1 — — 
b 2 = (Xhigh Xlow) 
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We planned contrast 2 to look at the difference between the experimental groups: 


Xhigh — Xiow = 5 — 3.2 = 1.8 


It should now be clear how b 2 represents this difference. Again, rather than being the 
absolute value of the difference between the experimental groups, b 2 is actually half of this 
difference (1.8/2 = 0.9). The familywise error is again controlled, by making the regression 
coefficient equal to the actual difference divided by the number of groups in the contrast 
(in this case 2). 




SELF-TEST 

s To illustrate these principles, I have created a file 
called Contrast.dat in which the Viagra data are 
coded using the contrast coding scheme used in 
this section. Run multiple regression analyses on 
these data using libido as the outcome and using 
dummyl and dummy2 as the predictor variables 
(leave all default options). 



Output 10.2 shows the result of this regression. The F-statistic for the model is the same 
as when dummy coding was used (compare it to Output 10.1), showing that the model 
fit is the same (it should be because the model represents the group means and these have 
not changed); however, the regression coefficients have now changed. The first thing to 
notice is that the intercept is the grand mean, 3.467 (see, I wasn’t telling lies). Second, the 
regression coefficient for contrast 1 is one-third of the difference between the average of 
the experimental conditions and the control condition (see above). Finally, the regression 
coefficient for contrast 2 is half of the difference between the experimental groups (see 
above). So, when a planned comparison is done in ANOVA a t-test is conducted compar¬ 
ing the mean of one chunk of variation with the mean of a different chunk. From the 
significance values of the t-tests we can see that our experimental groups were significantly 
different from the control (p < .05) but that the experimental groups were not significantly 


different (p > .05). 




Coefficients: 





Estimate Std. 

Error t 

value 

Pr(>|t|) 

(Intercept) 

3.4667 

0.3621 

9.574 

5.72e-07 *** 

dummyl 

0.6333 

0.2560 

2.474 

0.0293 * 

dummy2 

0.9000 

0.4435 

2.029 

0.0652 . 

Signif. codes 

: o ' *** ' 0 

.001 '** 

\ —1 
o 

o 

0.05 ' . ' 


Residual standard error: 1.402 
Multiple R-squared: 0.4604, 
F-statistic: 5.119 on 2 and 12 


on 12 degrees of freedom 

Adjusted R-squared: 0.3704 
DF, p-value: 0.02469 


Output 10.2 
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CRAMMING SAM’S TIPS 


Planned contrasts 


• After an ANOVA you need more analysis to find out which groups differ. 

• When you have generated specific hypotheses before the experiment, use planned contrasts. 

• Each contrast compares two ‘chunks’ of variance. (A chunk can contain one or more groups.) 

• The first contrast will usually be experimental groups vs. control groups. 

• The next contrast will be to take one of the chunks that contained more than one group (if there were any) and divide it in to 
two chunks. 

• You then repeat this process: if there are any chunks in previous contrasts that contained more than one group that haven't 

already been broken down into smaller chunks, then create a new contrast that breaks it down into smaller chunks. 

• Carry on creating contrasts until each group has appeared in a chunk on its own in one of your contrasts. 

• You should end up with one less contrast than the number of experimental conditions. If not, you’ve done it wrong. 

• In each contrast assign a ‘weight’ to each group that is the value of the number of groups in the opposite chunk in that 

contrast. 

• For a given contrast, randomly select one chunk, and for the groups in that chunk change their weights to be negative 
numbers. 

• Breathe a sigh of relief. 


10 . 4 . 3 . 


Non-orthogonal comparisons <D 


I have spent a lot of time labouring how to design appropriate orthogonal comparisons with¬ 
out mentioning the possibilities that non-orthogonal contrasts provide. Non-orthogonal 
contrasts are comparisons that are in some way related, and the best way to get them is to 
disobey Rule 1 in the previous section. Using my cake analogy again, non-orthogonal com¬ 
parisons are where you slice up your cake and then try to stick slices of cake together again. 
So, for the Viagra data a set of non-orthogonal contrasts might be to have the same initial 
contrast (comparing experimental groups against the placebo), but then to compare the 
high-dose group to the placebo. This disobeys rule 1 because the placebo group is singled 
out in the first contrast but used again in the second contrast. The coding for this set of 
contrasts is shown in Table 10.5, and by looking at the last column it is clear that when you 
multiply and add the codings from the two contrasts the sum is not zero. This tells us that 
the contrasts are not orthogonal. 


Table 10.5 Non-orthogonal contrasts for the Viagra data 


Group 

Dummy variable 1 
(Contrast J 

Dummy variable 2 
(Contrast 2 ) 

Product 

Contrast 1 x Contrast 2 

Placebo 

-2 

-1 

2 

Low dose 

1 

0 

0 

High dose 

1 

i 

1 

Total 

0 

0 

3 
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There is nothing intrinsically wrong with performing non-orthogonal 
contrasts. However, if you choose to perform this type of contrast you must 
be very careful about how you interpret the results. With non-orthogonal 
contrasts, the comparisons you do are related and so the resulting test sta¬ 
tistics and p-values will be correlated to some extent. For this reason you 
should use a more conservative probability level to accept that a given con¬ 
trast is statistically meaningful (see section 10.5). 


10 . 4 . 4 . 


Standard contrasts © 


Although under most circumstances you will design your own contrasts, there are special 
contrasts that have been designed to compare certain situations. Some of these contrasts 
are orthogonal, whereas others are non-orthogonal. 

Table 10.6 shows the contrasts that are available in R using the contrasts() function. This 
function is used to code any categorical variable and the resulting codings can be used in 
pretty much any linear model (ANOVA, regression, logistic regression, etc.). Although 
the exact codings are not provided in Table 10.6, examples of the comparisons done in a 
three- and four-group situation are given (where the groups are labelled 1, 2, 3 and 1, 2, 
3, 4, respectively). When you code variables R will treat the lowest-value code as group 1, 
the next highest code as group 2, and so on. Therefore, depending on which comparisons 
you want to make you should code your grouping variable appropriately (and then use 
Table 10.6 as a guide to which comparisons R will carry out). One thing that clever readers 
might notice about the contrasts in Table 10.6 is that some are orthogonal (i.e., Helmert 
contrasts) while others are non-orthogonal (e.g., treatment). You might also notice that the 
comparisons calculated using treatment contrasts are the same as those given by using the 
dummy variable coding described in Table 10.2). 


Table 10.6 Standard contrasts available in R 
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compared to the mean effect of 






4) 


all subsequent categories 

2 

2 

vs. 

3 

2 

vs. (3, 4) 



3 




3 

VS. 4 








CHAPTER 10 COMPARING SEVERAL MEANS: ANOVA (GLM 1) 


427 


10 . 4 . 5 . 


Polynomial contrasts: trend analysis (D 


One type of contrast deliberately omitted from Table 10.6 is the polynomial contrast, which 
can be obtained using contr.poly(). This contrast tests for trends in the data, and in its most 
basic form it looks for a linear trend (i.e., that the group means increase proportionately). 
However, there are other trends such as quadratic, cubic and quartic trends that can be 
examined. Figure 10.8 shows examples of the types of trend that can exist in data sets. The 
linear trend should be familiar to you all by now and represents a simple proportionate 
change in the value of the dependent variable across ordered categories (the diagram shows 
a positive linear trend, but of course it could be negative). A quadratic trend is where there is 
one change in the direction of the line (e.g., the line is curved in one place). An example of 
this might be a situation in which a drug enhances performance on a task at first, but then 
as the dose increases the performance drops again. To find a quadratic trend you need at 
least three groups (because in the two-group situation there are not enough categories of 
the independent variable for the means of the dependent variable to change one way and 
then another). A cubic trend is where there are two changes in the direction of the trend. 
So, for example, the mean of the dependent variable at first goes up across the first couple 
of categories of the independent variable, then across the succeeding categories the means 
go down, but then across the last few categories the means rise again. To have two changes 
in the direction of the mean you must have at least four categories of the independent vari¬ 
able. The final trend that you are likely to come across is the quartic trend, and this trend has 
three changes of direction (so you need at least five categories of the independent variable). 

Polynomial trends should be examined in data sets in which it makes sense to order the 
categories of the independent variable (so, for example, if you have administered five doses 
of a drug it makes sense to examine the five doses in order of magnitude). For the Viagra 
data there are only three groups and so we can expect to find only a linear or quadratic 
trend (and it would be pointless to test for any higher-order trends). 

Each of these trends has a set of codes for the dummy variables in the regression model, so 
we are doing the same thing that we did for planned contrasts except that the codings have 
already been devised to represent the type of trend of interest. In fact, the graphs in Figure 10.8 



FIGURE 10.8 

Linear, quadratic, 
cubic and quartic 
trends across 
five groups 
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have been constructed by plotting the coding values for the five groups. Also, if you add the 
codes for a given trend the sum will equal zero and if you multiply the codes you will find that 
the sum of the products also equals zero. Hence, these contrasts are orthogonal. The great thing 
about these contrasts is that you don’t need to construct your own coding values to do them, 
because the codings already exist. 


10.5. Post hoc procedures <D 


Often it is the case that you have no specific a priori predictions about the data you have 
collected and instead you are interested in exploring the data for any between-group differ¬ 
ences between means that exist. This procedure is sometimes called data mining or explor¬ 
ing data. Now, personally I have always thought that these two terms have certain ‘rigging 
the data’ connotations to them and so I prefer to think of these procedures as ‘finding the 
differences that I should have predicted if only I’d been clever enough’. 

Post hoc tests consist of pairwise comparisons that are designed to compare all differ¬ 
ent combinations of the treatment groups. So, it is rather like taking every pair of groups 
and then performing a t-test on each pair of groups. Now, this might seem like a particu¬ 
larly stupid thing to say in the light of what I have already told you about the problems 
of inflated familywise error rates. However, pairwise comparisons control the familywise 
error by correcting the level of significance for each test such that the overall Type I error 
rate (a) across all comparisons remains at .05. There are several ways in which the fami¬ 
lywise error rate can be controlled. The most popular (and easiest) way is to divide a by 
the number of comparisons, k, thus ensuring that the cumulative Type I error is below .05: 


Pent 


a 

k 


Therefore, if we conduct 10 tests, we use .005 as our criterion for significance. This method 
is known as the Bonferroni correction (Figure 10.9). There is a trade-off for controlling the 
familywise error rate, and that is a loss of statistical power. This means that the probability 
of rejecting an effect that does actually exist is increased (this is called a Type II error). 
By being more conservative in the Type I error rate for each comparison, we increase the 
chance that we will miss a genuine difference in the data. 


FIGURE 10.9 

Carlo Bonferroni 
before the 
celebrity of his 
correction led 
to drink, drugs 
and statistics 
groupies 
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Let’s look at this method, and some variations, using an example. Some research has sug¬ 
gested that children wearing superhero costumes might be more likely to harm themselves 
because of the unrealistic impression of invincibility that these costumes could create. 
For example, there are case studies of children reporting to hospital with severe injuries 
because of falling from windows or trying ‘to initiate flight without having planned for 
landing strategies’ (Davies, Surridge, Hole, & Munro-Davies, 2007). Having spent a lot 
of my childhood dressed in various costumes, I can relate to the imagined power that it 
bestows upon you; even now, I have been known to dress up as Fisher by donning a beard 
and glasses and trailing a goat around on a lead in the hope that it might make me more 
knowledgeable about statistics. 

Imagine that we wanted to see whether different types of superhero costumes led to 
more severe injuries. We measured the severity of injury on a scale from 0 to 100 (0 = not 
at all injured, 100 = dead), and made a note of the type of costume a child was wearing. 
Let’s also entertain the possibility that children fell (probably because of trying to fly) into 
four groups: Spiderman, Superman, the Hulk, and Teenage Mutant Ninja Turtles (let’s face 
it, who wouldn’t want to dress up as a ninja turtle?). These entirely fabricated data are in 
Superhero.dat. There is a task at the end of the chapter to analyse these data, but for now, 
let’s look at comparing all of these groups; we would end up with the six comparisons 
in Table 10.7. The table shows the unadjusted p-value that you get for each comparison. 
The critical value of p based on a Bonferroni correction for each comparison is the Type 
I error rate divided by the number of comparisons, u/k = .05/6 = .0083. If the observed 
p is smaller than the critical value then the comparison is significant (at a = .05). In this 
case, there is a significant difference between ninja turtle and Superman costumes (because 
.0000 is less than .0083) and between Superman and Hulk costumes (because .0014 is 
smaller than .0083). In all other cases p is bigger than the critical value so the difference is 
not significant. 


Table 10.7 Critical values forp based on variations on Bonferroni (* indicates that a comparison 
is significant) 



P 

Bonferroni 

a 

Peril- ^ 

/ 

Holm 

a 

Pcri,= y 

Benjamini-Hochberg 

j Pcrit _ (^)“ 

NT-Super 

.0000 

.0083 

■k 

6 

.0083 

~k 

1 

.0083 

Super-Hulk 

.0014 

.0083 

* 

5 

.0100 

~k 

2 

.0167 

Spider-Super 

.0127 

.0083 


4 

.0125 


3 

.0250 

NT-Spider 

.0252 

.0083 


3 

.0167 


4 

.0333 

NT-Hulk 

.1704 

.0083 


2 

.0250 


5 

.0417 

Spider-Hulk 

.3431 

.0083 


1 

.0500 


6 

.0500 


There are various improvements that have been made to the Bonferroni correction over 
the years and the general principle behind them is easy to understand so it’s worth explain¬ 
ing. In an attempt to make the Bonferroni correction less conservative (i.e., to make it 
better at detecting differences that actually exist), authors such as Hommel, Hochberg 
and Holm 5 have suggested stepped approaches (Hochberg, 1988; Holm, 1979; Hommel, 


5 Their names all begin with ‘Ho’, which I find a strange coincidence. If your surname begins with ‘Ho’ too, 
beware: a life in multiple comparison research could await you. 
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1988). Holm’s method is very simple to explain. You begin by computing the p-value for 
all of the pairs of groups in your data, you then order them from smallest to largest. We 
assign each p in the list an index (I’ve labelled it j) that tells us where in the list it falls. 
Table 10.7 shows this process: for the largest p we assign an index of 1, the next largest 2, 
and so on until the smallest one, which will be indexed as the number of comparisons ( k ), 
in this case 6. The critical value for a given comparison is the Type I error rate divided by 
the index variable (/’): 

a 

Pent ~ — 

1 

Starting from the smallest p-value, this means that you begin with the normal Bonferroni 
correction because j = k for this first comparison. However, notice that in subsequent 
comparisons we do not correct for every comparison made, instead we correct only for the 
remaining comparisons. Unlike the standard Bonferroni correction, the critical value of p 
gets bigger (and less conservative) for each comparison. The key idea behind this method 
is it is stepped. This means that as long as a comparison is significant, we proceed to the 
next one, but at the point that we encounter a non-significant comparison we stop and 
assume that all remaining comparisons are nonsignificant also. In Table 10.7, we see a 
significant difference between Ninja Turtle and Superman costumes (because .0000 is less 
than .0083); therefore, we move onto the next one down and see a significant difference 
between Superman and Hulk costumes (because .0014 is smaller than .01); therefore we 
move down again but find a non-significant difference between Spiderman and Superman 
costumes (because .0127 is larger than .0125); because of this non-significance we stop and 
do not consider any further comparisons. 

A more modern take on this kind of sequential approach to multiple comparisons is to 
worry not about the familywise error rate, but to focus on the false discovery rate (FDR). 
By focusing on the familywise error rate we are obsessing (in some people, literally) about 
the possibility of making one or more Type I errors. The corresponding belief system can be 
summed up as ‘if I make even one Type I error then my entire set of conclusions is mean¬ 
ingless’. With a belief system like that it’s no wonder people look depressed when they’re 
analysing data. Benjamini and Hochberg think about things differently. Their belief system 
can be summed up as the rather more joyful ‘let’s try to estimate how many Type I errors 
(or false discoveries) we have made’. The FDR is simply the proportion of falsely rejected 
null hypotheses: 

PPR number of falsely rejected null hypotheses 
totalnumberof rejected null hypotheses 

As such, the FDR approach to multiple comparisons is less strict than Bonferroni-based 
methods because it is concerned with keeping the FDR rather than the familywise error 
rate under control. In Benjamini and Hochberg’s method (Benjamini & Hochberg, 1995, 
2000) you start by computing the p-value for all of the pairs of groups in your data. You 
then order them and, as with Holm’s method, index the order with the letter j (notice we 
order them the opposite way around to Holm’s method). For each comparison you deem 
it significant if the observed p is smaller than a critical value defined as: 



Table 10.7 again shows this process. For the largest p-value we again have the normal 
Bonferroni correction (i.e., a/k), for the other comparisons we use a more liberal criterion. 
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Like Holm’s method this procedure is stepped; however, rather than working down the 
table we work up (hence it is known as a ‘step-up’ procedure). So, we begin at the bottom 
and conclude a non-significant difference between Spiderman and Hulk costumes (because 
.3431 is greater than .05); given this non-significance we move up the table and see a non¬ 
significant difference between Ninja Turtle and Hulk costumes (because .1704 is greater 
than .0417); given this non-significance we again move up the table and see a significant 
difference between Ninja Turtle and Spiderman costumes (because .0252 is less than the 
critical value of .0333); because of this significance we stop and assume that all other com¬ 
parisons are also significant. Procedurally this step-up approach is the opposite of Holm’s 
step-down procedure. 

There are many other post hoc procedures. I have explained only a few of the main ones 
that can be implemented in R. I could go into all of the other methods in tedious detail but 
there are some excellent texts already available for those who wish to know (Klockars &C 
Sax, 1986; Toothaker, 1993) and R does not implement most of them anyway. (That said, 
the nice thing about R of course is that you could write your own function to do them if 
you had a few spare hours, a maths degree, and a bottle of gin.) However, it is important 
that you have an idea of which post hoc tests perform best. ‘Best’ is a word that can mean 
many things. For post hoc procedures, deciding on what’s ‘best’ requires us to consider 
three things: whether the test controls the Type I error rate; whether the test controls the 
Type II error rate (i.e., has good statistical power); and whether the test is reliable when the 
test assumptions of ANOVA have been violated. 


10 . 5 . 1 . 


Post hoc procedures and Type I (a) and 
Type II error rates © 


The Type I error rate and the statistical power of a test are linked. Therefore, there is 
always a trade-off: if a test is conservative (the probability of a Type I error is small) then 
it is likely to lack statistical power (the probability of a Type II error will be high). So, it is 
important that multiple comparison procedures control the Type I error rate but without a 
substantial loss in power. If a test is too conservative then we are likely to reject differences 
between means that are, in reality, meaningful. 

Bonferroni’s and Tukey’s HSD S tests both control the Type I error rate very well but are 
conservative tests (they lack statistical power). Of the two, Bonferroni has more power 
when the number of comparisons is small, whereas Tukey is more powerful when testing 
large numbers of means. Tukey generally has greater power than other tests of which you 
might have heard such as Dunn and Scheffe. Holm’s method should have more power than 
Bonferroni, and the Benjamini-Hochberg method should have more power than Holm’s 
procedure. If you are obsessed with controlling the Type I error rate, it is worth remember¬ 
ing that the Benjamini-Hochberg method does not attempt to do this: it controls the FDR. 



10 . 5 . 2 . 


Post hoc procedures and violations of test 
assumptions © 


Most research on post hoc tests has looked at whether the test performs well when the 
group sizes are different (an unbalanced design), when the population variances are very 


6 HSD stands for ‘honest significant difference’,which has a slightly dodgy ring to it if you ask me! 
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different, and when data are not normally distributed. The good news is that most multiple 
comparison procedures perform relatively well under small deviations from normality. The 
bad news is that they perform badly when group sizes are unequal and when population 
variances are different. 

There are a variety of tests designed to deal with these situations, none of which are 
implemented in R. Hochberg’s GT2 is one such test and is worth mentioning because it 
is not implemented in R and is completely different than the Hochberg and Benjamini- 
Hochberg methods that I have already mentioned. Therefore, don’t use the Hochberg 
option in R thinking it can cope with unequal variances: it is a different test. 

Instead of telling you what can’t be done, it might be more helpful to tell you what 
can be done. There are some robust methods that have been implemented in R by Wilcox 
(2005). As with methods for the ANOVA itself, these methods are based on bootstrapping 
or trimmed means and M-estimators (both of which can also include a bootstrap). All of 
these methods are very new and so there is very little on which to base advice on what to do 
for the best. However, all methods have been shown to control the Type I error well when 
applied to some very extreme distributions. If Type I error control is your main concern then 
the bootstrap seems to offer a small advantage, and if power is your concern then there are 
some benefits to methods based on M-estimators (Wilcox, 2003). However, the bottom line 
is that using any of these methods is undoubtedly better than using a non-robust method. 


10 . 5 . 3 . 


Summary of post hoc procedures © 



The choice of comparison procedure will depend on the exact situation you have and 
whether it is more important for you to keep strict control over the familywise error rate, 
the FDR, or to have greater statistical power. However, some general guidelines can be 
drawn (Toothaker, 1993). When you have equal sample sizes and you are confident that 
your population variances are similar then Tukey has good power and tight control over 
the Type I error rate. Bonferroni is generally conservative, but if you want guaranteed 
control over the Type I error rate then this is the test to use. If there is any doubt over the 
underlying assumptions (e.g., unequal population variances) then use a robust method 
based on a bootstrap, trimmed means, or M-estimators. 


CRAMMING SAM’S TIPS 


Post hoc tests 


• After an ANOVA you need a further analysis to find out which groups differ. 

• When you have no specific hypotheses before the experiment, use post hoc tests. 

• When you have equal sample sizes and group variances are similar, use Tukey. 

• If you want guaranteed control over the Type I error rate, then use Bonferroni. 

• If there is any doubt that group variances are equal, then use a robust method (e.g., bootstrap or trimmed means). 


10.6. One-way ANOVA using R © 


Hopefully you should all have some appreciation for the theory behind ANOVA, so let’s 
put that theory into practice by conducting an ANOVA test on the Viagra data. 
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10 . 6 . 1 . 


Packages for one-way ANOVA in R © 


There are several packages that we will use in this chapter. If you’re using R Commander 
(see the next section) then you don’t need to worry: it will load everything it needs auto¬ 
matically. If you’re using commands (which we recommend), you will need the packages 
car (for Levene’s test), compute.es (for effect sizes) ggplot2 (for graphs), multcomp (for 
post hoc tests), pastecs (for descriptive statistics), and WRS (for robust tests). If you do not 
have these packages installed (some should be installed from previous chapters), you can 
install them by executing the following commands: 

install.packages("compute.es"); install.packages("car"); install.packages 
C"ggplot2");install.packages("multcomp");install.packages("pastecs");install. 
packagesC"WRS", repos="http://R-Forge.R-project.org") 

You then need to load these packages by executing these commands: 

libraryCcompute.es); library(car); library(ggplot2); library(multcomp); 
library(pastecs); library(WRS) 


10 . 6 . 2 . 


General procedure for one-way ANOVA © 


To conduct one-way ANOVA you should follow this general procedure: 

1 Enter data: obviously you need to enter your data. 

2 Explore your data: as with any analysis, it’s a good idea to begin by graphing your data 
and computing some descriptive statistics. You should also check distributional assump¬ 
tions and use Levene’s test to check for homogeneity of variance (see Chapter 5). 

3 Compute the basic ANOVA: you can then run the main analysis of variance. 
Depending on what you found in the previous step, you might need to run a robust 
version of the test. 

4 Compute contrasts or post hoc tests: having conducted the main ANOVA you can fol¬ 
low it up with either contrasts or post hoc tests. Again, the exact methods you choose 
will depend upon what you unearth in step 2. 

We will work through these steps in turn. 


10 . 6 . 3 . 


Entering data © 


As with the independent t-test, we need to enter the data into R using a coding variable to 
specify to which of the three groups the data belong. So, the data must be entered in two 
columns (one called dose which specifies how much Viagra the participant was given and 
one called libido which indicates the person’s libido over the following week). The data 
are in the file Viagra.dat, but I recommend entering them by hand to gain practice in data 
entry. I have coded the grouping variable so that 1 = placebo, 2 = low dose and 3 = high 
dose (see section 3.5.4.3). 
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This data set is small (only 15 cases); therefore, we could enter the data directly into R 
by executing the following code: 

libido<-c(3,2,l,l,4,5,2,4,2,3,7,4,5,3,6) 

dose<-gl(3,5, labels = c("Placebo", "Low Dose", "High Dose")) 
viagraData<-data.frameCdose, libido) 

These commands create a variable called libido with the 15 libido scores contained within 
it, and a variable called dose, which uses the gl() function to create a factor variable with 
three groups each containing five participants. These variables are merged into a dataframe 
called viagraData. We can look at the contents of the dataframe by executing: 

viagraData 

You will see the following displayed in the console: 



dose 

libido 

1 

Placebo 

3 

2 

Placebo 

2 

3 

Placebo 

1 

4 

Placebo 

1 

5 

Placebo 

4 

6 

Low Dose 

5 

7 

Low Dose 

2 

8 

Low Dose 

4 

9 

Low Dose 

2 

10 

Low Dose 

3 

11 

High Dose 

7 

12 

High Dose 

4 

13 

High Dose 

5 

14 

High Dose 

3 

15 

High Dose 

6 


10 . 6 . 4 . 


One-way ANOVA using R Commander (D 


Running ANOVA using commands gives you much more versatility than R Commander. 
However, you can do a basic one-way ANOVA using R Commander. First load the data 
from the file Viagra.dat by using the Data=>Import data=>from text file, clipboard, or 
URL... menu (see section 3.7.3). This data set has two variables: dose, which is the group¬ 
ing variable (1 = placebo, 2 = low dose, 3 = high dose); and libido, which is each partici¬ 
pant’s libido score. Once the data are loaded in a dataframe (I have called the dataframe 
viagraData ), you need to convert the variable dose into a factor - see section 3.6.2 to 
remind yourself how to do that. 

Once you have done that, you need to explore the data: get some descriptive statistics 
and test the assumptions. This is explained in Chapter 5. Levene’s test looks at whether 
variances across conditions are equal - in other words, it tests the assumption of homoge¬ 
neity of variance (see section 10.3.1). Use the Statistics=>Variances=>Levene’s test... menu 
to run the analysis. The resulting dialog box is fairly self-explanatory (Figure 10.10): 
select a factor from the list labelled Groups (in this case we have only one factor, dose) and 
select the outcome variable from the list labelled Response Variable (in this case libido). By 
default, R Commander will base Levene’s test on deviations from the median, which is a 
better measure than using deviations from the mean, but you can change this option if you 
like. Click on on to run the analysis. The resulting output is described in section 10.6.5. 
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FIGURE 10.10 

Levene’s 
test using R 
Commander 




FIGURE 10.11 

One-way 
ANOVA using R 
Commander 


To do the ANOVA, use the Statistics=>Means=>One-way ANOVA... menu. 7 The result¬ 
ing dialog box is fairly self-explanatory (Figure 10.11). You need to enter a name for the 
model that you’re going to create (I have chosen viagraModel) in the box labelled Enter 
name for model:, select a factor from the list labelled Groups (in this case we have only 
one factor, dose) and select the outcome variable (in this case libido) from the list labelled 


7 If this menu isn’t active it could be because you haven’t converted dose into a factor. You need to have at least 
one factor in the dataframe for this menu to be active. 
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FIGURE 10.12 

Error bar chart 
of the Viagra 
data (95% 
bootstrapped 
confidence 
intervals) 


Response Variable. You cannot do planned comparisons using R Commander, but if you 
want a basic set of post hoc tests then select Psirwise comparisons of means w. Click on I °k I to run the 
analysis. The resulting output is described in sections 10.6.6.1 and 10.6.8.2. 


10 . 6 . 5 . 


Exploring the data (D 


In Chapter 4 we saw that it is always a good idea to look at a graph of your data. In this 
case we will produce a line graph with error bars. 



SELF-TEST 

s Use ggplot2 to produce a line chart with error bars 
showing bootstrapped confidence intervals for the 
Viagra data. 


Figure 10.12 shows a line chart with error bars of the Viagra data. It’s clear from this 
chart that all of the error bars overlap, indicating that, at face value, there are no between- 
group differences (although this measure is only approximate). The line that joins the 
means seems to indicate a linear trend in that, as the dose of Viagra increases, so does the 
mean level of libido. 


7- 



2 - 


1 - 



Dose of Viagra 


To get some descriptive statistics for each group we can use the by() function that we 
encountered in Chapter 5. Remember that this function takes the general form: 

byCvariable, group, output) 
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in which variable is the thing that you want to summarize (in this case libido), group is the 
variable that defines the groups by which you want to organize the output (in this case 
dose), and output is a function that tells R what output you would like to see (i.e., the 
mean). If we use the function stat.desc() from the package pastecs then R will output a host 
of useful descriptive statistics. Therefore, by combining by() and stat.descQ, we can get a 
table of descriptives for each group in a single line of code: 

by(viagraData$libido, viagraData$dose, stat.desc) 

Output 10.3 shows the resulting descriptive statistics (I have edited the output slightly to 
fit the page so you will see more decimal places and a few extra variables). Most of the 
variables are self-explanatory: we have the number of valid cases ( nbr.val ), minimum (min) 
and maximum (max) libido, the range, median, mean and variance (var), standard devia¬ 
tion (std.dev), standard error (SE.mean) and confidence interval (Crimean.0.95). 

viagraData$dose: Placebo 

nbr.val min max range sum median mean SE.mean 
5.000 1.00 4.00 3.00 11.00 2.00 2.2000 0.5831 


Cl.mean.0.95 var 

1.6189318 1.7000000 


std.dev 

1.3038405 


coef.var 
0.5926548 


viagraData$dose: Low Dose 

nbr.val min max range 
5.000 2.00 5.00 3.00 

Cl.mean.0.95 var 

1.6189318 1.7000000 


sum median 
16.00 3.00 

std.dev 

1.3038405 


mean SE.mean 
3.200 0.5831 

coef.var 
0.4074502 


viagraData$dose: High Dose 

nbr.val min max range sum median mean SE.mean 
5.00 3.00 7.00 4.00 25.00 5.0 5.0000 0.7071 

Cl.mean.0.95 var std.dev coef.var 

1.9632432 2.5000000 1.5811388 0.3162278 

Output 10.3 

The first thing to notice from Output 10.3 is that the means and standard deviations 
correspond to those shown in Table 10.1. In addition, we are told the standard error. You 
should remember that the standard error is the standard deviation of the sampling distribu¬ 
tion of these data (so for the placebo group, if you took lots of samples from the population 
from which these data come, the means of these samples would have a standard deviation 
of 0.5831). 

We are also given confidence intervals for the mean. By now, you should be familiar 
with what a confidence interval tells us, and that is that if we took 100 samples from 
the population from which the placebo group came and constructed confidence inter¬ 
vals for the mean, then 95 of these intervals would contain the true value of the mean. 
Crimean.0.95 doesn’t give you the interval itself, but the value to add or subtract from 
the mean to create the interval. For example, in the placebo group the lower bound of 
the Cl would be the mean minus Crimean.0.95 (i.e., 2.2000 - 1.6189 = 0.5811) and the 
upper bound of the Cl would be the mean plus Cl.mean.0.95 (i.e., 2.2000 + 1.6189 = 
3.8189). In other words, the true value of the mean is likely to be between 0.5811 and 
3.8189. Although these diagnostics are not immediately important, we will refer back to 
them throughout the analysis. 
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The final thing before we get to the ANOVA itself is to compute Levene’s test (see 
Chapter 5 and section 10.3.1). We encountered the levene.Test() function from the car 
package in Chapter 5, and we can again use it here. Just to remind you, it takes the general 
form: 

leveneTestfoutcome variable, group, center = median/mean) 

So, if we want to do a Levene’s test to see whether the variance in libido (the outcome) 
varies across groups that received different doses of the drug (dose), we can execute: 

leveneTest(viagraData$libido, viagraData$dose, center = median) 

The output (Output 10.4) shows that Levene’s test is very non-significant, F( 2, 12) = 
0.118, p = .89. This means that for these data the variances are very similar (hence the high 
probability value); in fact, if you look at Output 10.3 you’ll see that the variances of the 
placebo and low-dose groups are identical. Had this test been significant, we could instead 
conduct and report the results of Welch’s F or a robust version of ANOVA, which we’ll 
cover in the next section. 

Levene's Test for Homogeneity of Variance 
Df F value Pr(>F) 
group 2 0.1176 0.89 

12 

Output 10.4 


10 . 6 . 6 . 


The main analysis <D 


10.6.6.1. When the test assumptions are met (D 

There are two functions that can be used for ANOVA: lm(), which we used in Chapter 7, 
and aov(). As I explained earlier in the chapter, ANOVA is just a special case of the general 
linear model; therefore, we can use the linear model function, lm(), to run the analysis. 
For the current example, we are predicting libido from group membership (i.e., dose of 
Viagra) so our model is: 

libido, = dose, + error, 

Therefore, we can create a model (which I’ve called viagraModel ) using lm() by executing: 

viagraModel<-lm(libido~dose, data = viagraData) 

where libido-dose simply creates the model ‘libido predicted from dose’. 

The other function we can use is aov(), which stands for analysis of variance. Actually, 
aov() and lm() are exactly the same as each other. However, aov() takes the output from 
lm() and returns it to us in a way that is more in keeping with a traditional ANOVA 
approach. It’s what is known as a ‘wrapper’: it is lm() but ‘wrapped’ up differently. I’m 
going to stick with the aov() function because it yields output that maps onto traditional 
ANOVA methods, but be clear that underneath we’re actually using lm() to do the hard 
work. 

The aov() function has the following general format: 

newModelc-aovfoutcome ~ predictor(s), data = dataFrame, na.action = an 
action)) 
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in which: 

• netvModel is an object created that contains information about the model. We can 
get summary statistics for this model by executing summary (netvModel) for the main 
ANOVA summary and summarydm(newModel) for specific parameters of the model. 

• outcome is the variable that you’re trying to predict, also known as the dependent 
variable. In this example it will be the variable libido. 

• predictor(s) lists the variable or variables from which you’re trying to predict the 
outcome variable, also known as the independent variable(s). In this example it will 
be the variable dose. In more complex designs we can specify several predictors or 
independent variables, but we’ll come to that in subsequent chapters. 

• dataFrame is the name of the dataframe from which your outcome and predictor 
variables come. 

• na.action is an optional command. If you have complete data (as we have here) you 
can ignore it, but if you have missing values (i.e., NAs in the dataframe) then it can be 
useful to use na.action = na.exclude, which will exclude all cases with missing values 
- see R’s Souls’ Tip 7.1). 

For the current example, then, we could execute the following command: 
viagraModel<-aov(libido ~ dose, data = viagraData) 

to generate the model (note that the command is basically identical to when we used lm() 
to run an ANOVA above). We now have an object called viagraModel that contains infor¬ 
mation about how well dose predicts libido. To see the summary statistics execute: 

summary(viagraModel) 

Executing this command generates Output 10.5. The output is divided into effects due 
to the model (the experimental effect) and residuals (this is the unsystematic variation in 
the data). The effect labelled dose is the overall experimental effect. In this row we are told 
the sums of squares for the model (SS M = 20.13) and this value corresponds to the value 
calculated in section 10.2.6. The degrees of freedom are equal to 2 and the mean squares 
value for the model corresponds to that calculated in section 10.2.8 (10.067). The sum of 
squares and mean squares represent the experimental effect. The row labelled Residuals 
gives details of the unsystematic variation within the data (the variation due to natural 
individual differences in libido and different reactions to Viagra). The table tells us how 
much unsystematic variation exists (the residual sum of squares, SS R ) and this value (23.60) 
corresponds to the value calculated in section 10.2.7. The table then gives the average 
amount of unsystematic variation, the mean squares (MS R ), which corresponds to the value 
(1.967) calculated in section 10.2.8. The test of whether the group means are the same is 
represented by the F-ratio for the effect of dose. The value of this ratio is 5.12, which is 
the same as was calculated in section 10.2.9. Finally, R tells us whether this value is likely 
to have happened by chance. The final column labelled Pr(>F) indicates the likelihood of 
an F-ratio the size of the one obtained occurring if there was no effect in the population 
(see also R’s Souls’ Tip 10.1). In this case, there is a probability of .025 that an F-ratio of 
this size would occur if in reality there was no effect (that’s only a 2.5% chance!). We have 
seen in previous chapters that we use a cut-off point of .05 as a criterion for statistical sig¬ 
nificance. Flence, because the observed significance value is less than .05 we can say that 
there was a significant effect of Viagra. Flowever, at this stage we still do not know exactly 
what the effect of Viagra was (we don’t know which groups differed). One thing that is 
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interesting here is that we obtained a significant experimental effect, yet our error bar plot 
indicated that no significant difference would be found. This contradiction illustrates how 
the error bar chart can act only as a rough guide to the data. 

Df Sum Sq Mean Sq F value Pr(>F) 
dose 2 20.133 10.0667 5.1186 0.02469 * 

Residuals 12 23.600 1.9667 


Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 0.1 ' ' 1 

Output 10.5 





1 

t 


R’s Souls’ Tip 10.1 

1 


One- and two-tailed tests in ANOVA© 


A question I get asked a lot by students is ‘is the significance of the ANOVA one- or two-tailed, and if it’s two-tailed 
can I divide by 2 to get the one-tailed value?’ The answer is that to do a one-tailed test you have to be making a 
directional hypothesis (i.e., the mean for cats is greater than for dogs). ANOVA is a non-specific test, so it just tells 
us generally whether there is a difference or not, and because there are several means you can’t possibly make 
a directional hypothesis. As such, it’s invalid to halve the significance value. 


The aov() function also automatically generates some plots that we can use to test the 
assumptions. We can see these graphs by executing: 

plot(viagraModel) 

The results are in Figure 10.13. You will actually see four graphs, but the first two are the 
most important for ANOVA. The first graph (on the left of the figure) can be used for test¬ 
ing homogeneity of variance. We encountered this kind of plot in Chapter 7: essentially, 
if it has a funnel shape then we’re in trouble. The plot we have shows points that are 
equally spread for the three groups, which implies that variances are similar across groups 
(which was also the conclusion reached by Levene’s test). The second plot (on the right) 
is a Q-Q plot (see Chapter 5), which tells us something about the normality of residuals in 
the model. We want our residuals to be normally distributed, which means that the dots 
on the graph should cling lovingly to the diagonal line. Ours look like they have had a bit 
of an argument with the diagonal line, which suggests that we may not be able to assume 
normality of errors and should perhaps use a robust version of ANOVA instead (which will 
be explained sooner than you might like). 


10.6.6.2. When variances are not equal across groups © 


If Levene’s test is significant then it is reasonable to assume that population variances are 
different across groups. 8 In this case, if our distributions are as they should be, we can apply 


8 It’s worth reminding you that any significance test depends on sample size: in small samples there won’t be 
power to detect differences across groups, and in large samples even small differences in variances might be 
deemed significant. As such, don’t place too much weight on Levene’s test if it’s non-significant in a small sample, 
or significant in a large sample. 
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Residuals vs Fitted 


Normal Q-Q 




Theoretical Quantiles 
aov(libido ~ dose) 


Welch’s F to the data, which makes adjustments for differences in group variances. This 
test is produced by the oneway.test() function, which is built into R. The format of this test 
is the same as aov(): 

oneway.test(outcome ~ predictor, data = dataframe) 

Therefore, we can get the output for Welch’s F for the current data by executing: 

oneway.test(libido ~ dose, data = viagraData) 

Output 10.6 shows Welch’s F-ratio. For our data we didn’t need this test because our 
Levene’s test was not significant, indicating that our population variances were similar. 
However, when homogeneity of variance has been violated you should look at this F-ratio 
instead of the ones in the previous section. If you’re interested in how these values are 
calculated then look at Oliver Twisted, but to be honest it’s not that much fun and you’d 
probably enjoy yourself more if you spent the time sticking jellyfish down your pants. 
You’re much better off just trusting that R has done what it was supposed to do. Note that 
the error degrees of freedom have been adjusted - you should remember this when you 
report the values. For these data, Welch’s F(2, 7.94) = 4.23, p = .054, which is just about 
non-significant. If we were using this test it would imply that the mean libido did not differ 
significantly across different doses of Viagra. 

One-way analysis of means (not assuming equal variances) 
data: libido and dose 

F = 4.3205, num df = 2.000, denom df = 7.943, p-value = 0.05374 

Output 10.6 


10.6.6.3. Robust ANOVA - it’s not for the weak of heart (D 


Wilcox (2005) describes a set of robust procedures for conducting one-way ANOVA. Load 
these functions using the instructions in section 5.8.4. Having done this, we now have 
access to Wilcox’s functions. The first issue with using these functions is that most of them 
require the data to be in wide format rather than the long format that we have been using 


FIGURE 10.13 

Plots of an 
ANOVA model 
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so far in this chapter. We can convert the data to wide format using the unstack() command 
(see section 3.9.4), which has the general form: 

newDataFrame<-unstack(oldDataFrame, scores ~ columns) 

In this case our scores are stored in the variable libido and we want to make different 
columns for each group, so our columns variable is dose. Therefore, we can reformat the 
data by executing: 

viagraWide<-unstack(viagraData, libido ~ dose) 

This command creates a new dataframe called viagraWide, which is our Viagra data but in 
wide format, so each column represents a different group: 


1 

2 

3 

4 

5 


Placebo Low.Dose High.Dose 

3 5 7 

2 2 4 

14 5 

12 3 

4 3 6 


This is the format that Wilcox’s functions expect. The first robust function, t1way(), is 
based on a trimmed mean. It takes the general form: 


tlwayCdataFrame, tr = .2, grp = c(x, y, ..., z)) 


in which, 


• dataFrame is the name of the dataframe to be analysed. 

• tr is the proportion of trimming to be done. The default is .2 or 20%, and you need 
to use this option only if you want to specify an amount other than 20%. 

• grp can be used to specify particular groups by referring to their column in the 
dataframe; for example, if we wanted to analyse only the placebo and high-dose 
group, we could do this usinggrp = c(l,3). 

As such, for an ANOVA of the Viagra data based on 20% trimmed means we simply execute: 
tlway(viagraWide) 

If we wanted to trim only 10% of the data then we could execute: 
tlwayCviagraWide, tr = .1) 

If you execute this command you will see Output 10.7, which shows that, based on this 
robust test, there is not a significant difference in libido scores across the three dose groups, 
F t { 2, 7.94) =4.32, p = . 054. 

We can also compare medians rather than means using med1way(), which takes the gen¬ 
eral form: 

medlwayCdataFrame, grp = cQc, y, ..., z)) 

in which, dataFrame is the name of the dataframe to be analysed and grp is used in the 
same way as in tlwayQ. As such, for an ANOVA of the Viagra data based on medians we 
simply execute: 

medlway(viagraWide) 

If you execute this command you will see Output 10.7, which shows that, based on this 
robust test, there is not a significant difference in median libido scores across the three dose 
groups, F m = 4.78, p = .07. 
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A final method is to add a bootstrap to the trimmed mean method using t1waybt(). This 
function has the general form: 

tlwaybt(dataFrame, tr = .2, alpha = .05, grp = c(x, y, z), nboot = 599) 

which is the same as tlway() except that we have two additional options. The first is alpha, 
which sets the Type I error rate. The default is .05, which is fairly standard, so unless you 
want something different you don’t need to use this option. The second is nboot, which 
specifies the number of bootstrap samples to be used. The default is 599, which, if any¬ 
thing, you might want to increase (but it’s probably not necessary to use more than 2000). 
As such, for an ANOVA of the Viagra data based on 20% trimmed means, with 599 boot¬ 
strap samples, we execute: 

tlwaybt(viagraWide) 

However, if we wanted, for example, a 5% trimmed mean with 2000 bootstrap samples 
we would execute: 

tlwaybt(viagraWide, tr = .05, nboot = 2000) 

If you execute the tlwaybt() function with the default settings you will see Output 10.7, 
which shows that, based on this robust test, there is not a significant difference in trimmed 
mean libido scores across the three dose groups, F = 3, p = .089. In short, all three robust 
methods suggest that dose does not have a significant impact on libido. 


tlwayQ output 

medlway() output 

tlwaybt() output 

$TEST 

$TEST 

$test 

[1] 4.320451 

[1] 4.782879 

[1] 3 

$nul 

$crit.val 

$p.value 

[1] 2 

[1] 5.472958 

[1] 0.0886076 

$nu2 

$p.value 


[1] 7.943375 

[1] 0.07 


$siglevel 
[1] 0.05373847 




Output 10.7 


10.6.7. 


Planned contrasts using R (D 


To do planned comparisons in R we have to set the contrast attribute of our grouping vari¬ 
able using the contrast() function and then re-create our ANOVA model using aov(). By 
default, dummy coding is used, which was explained in section 10.2.3. We can see this if 
we summarize our existing viagraModel using the summary.lm() function rather than sum- 
mary(). By using summary.lm() we are asking for a summary of the parameters of the linear 
model (rather than the overall ANOVA). Assuming you still have the viagraModel object (if 
not, re-create it) execute this command: 

summary.ImCviagraModel) 
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You should get Output 10.8. Note that this is basically the same as Output 10.1, which we 
used to explain how dummy coding works. So, the ‘low dose’ effect is the effect of low 
dose compared to placebo and is non-significant (t = 1.13, p = .282), whereas the effect of 
high dose compared to the placebo group is significant (t = 3.16, p = .008). 


Coefficients: 

Estimate Std. Error 


(Intercept) 2.2000 
doseLow Dose 1.0000 
doseHigh Dose 2.8000 


0.6272 
0.8869 
0.8869 


t value Pr(>|t|) 
3.508 0.00432 

1.127 0.28158 

3.157 0.00827 


* * 


* * 


Signif. codes: 


0 '***' 0.001 '** 


0.01 0.05 


0 . 1 ' ' 1 


Residual standard error: 1.402 
Multiple R-squared: 0.4604, 
F-statistic: 5.119 on 2 and 12 


on 12 degrees of freedom 

Adjusted R-squared: 0.3704 
DF, p-value: 0.02469 


Output 10.8 

This is all very well, but what if we do not want dummy coding, but want to use our own 
planned comparisons, use another built-in comparison, or do a trend analysis? In general, 
we do this by resetting the contrast attribute associated with our predictor variable (in this 
case dose), using the following general command: 

contrasts(predictor variable)<-contrast instructions 

The contrast instructions can be either a set of weights for the contrasts that you want 
to do, or one of the built-in contrasts listed in Table 10.6. These built in functions 
can be: 


contr.helmertCn) 
contr.poly(n) 

contr.treatment(n, base = x) 
contr.SAS(n) 

In all cases, n is the number of groups in the predictor variable (for dose, this value will be 
3). The contr.treatmentQ function has an additional option, base, which allows you to specify 
the group that you want to use as a baseline. Therefore, if you want dummy coding (i.e., the 
first category is the baseline) you would use contr.treatment(n, base = 1). The function contr. 
SAS() is the same as using contr.treatmentQ when you select the last category as the baseline. 

To put this all together, if we wanted to set the contrast property of dose to be a Helmert 
contrast then we would execute: 

contrasts(viagraData$dose)<-contr.helmert(3) 

Note that the 3 is the number of groups present in the dose variable. We’re not going to 
use this contrast, though, we’re going to specify our own. 


10.6.7.1. Your own contrasts (D 


To conduct the planned comparisons described in section 10.4, we follow the general pro¬ 
cedure just described. We need to tell R what weights to assign to each group. The first step 
is to decide which comparisons you want to do and then what weights must be assigned to 
each group for each of the contrasts. We have already gone through this process in section 
10.4.2, so we know that the weights for contrast 1 were —2 (placebo group), +1 (low-dose 
group) and +1 (high-dose group). If we wanted to express these weights we could create a 
new object called contrastl and use the function c() to list the weights: 

contrastl<-c(-2,l,1) 
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This variable indicates that the first group has a weight of -2, and the second and third groups 
a weight of 1. The order of the numbers is important because it corresponds to the order of 
groups in your predictor variable. In the Viagra data, remember that the order of groups was: 
placebo (because it was coded with the lowest value, 1), low dose (because it was coded using 
the next lowest number, 2), and high dose (because it was coded with the highest number, 3). 
As such, contrastl has the weights for placebo, low dose and high dose, in that order. 

We can do the same for the second contrast. We know from section 10.4.2 that the 
weights for contrast 2 were: 0 (placebo group), —1 (low-dose group) and +1 (high-dose 
group). Remembering that the first weight we enter will be for the placebo group, we must 
enter the value 0 as the first weight, then —1 for the low-dose group and finally 1 for the 
high-dose group. It is imperative that you remember to input zero weights for any groups 
that are not in the contrast. We can specify this contrast by executing: 

contrast2<-c(0,-1,1) 

which creates a variable called contrastl that contains the weights for the second contrast. 

Having created these variables we now need to bind them together using cbind(), which 
literally binds two columns of data together, and set them as the contrast attached to our 
predictor variable, dose. We can do this by executing: 

contrasts(viagraData$dose)<-cbind(contrastl, contrast2) 

This command sets the contrast property of dose to contain the weights for the two con¬ 
trasts that we want to conduct. 9 If you have a look at the dose variable by executing: 

viagraData$dose 

You’ll see this: 

[1] Placebo Placebo Placebo Placebo Placebo Low Dose Low 
Dose Low Dose Low Dose Low Dose High Dose High Dose High Dose 

[14] High Dose High Dose 
attr(,"contrasts") 

contrastl contrast2 
Placebo -2 0 

Low Dose 1 -1 

High Dose 1 1 

Levels: Placebo Low Dose High Dose 

Note that the variable now has a contrast attribute that contains the weights that we just 
specified. This is very useful to look at to check that you have entered the weights cor¬ 
rectly. Remember that when we do planned comparisons we arrange the weights such that 
we compare any group with a positive weight against any group with a negative weight. 
Therefore, the table of weights shows that contrast 1 compares the placebo group against 
the two experimental groups, and contrast 2 compares the low-dose group to the high- 
dose group. These are the contrasts we wanted. Happy days. 

Once we have set the contrast attribute we create a new model using aov(), in exactly the 
same way as we did before, by executing: 

viagraPlanned<-aov(libido ~ dose, data = viagraData) 

If you use the summary() command you’ll see that the model is the same as the viagraModel 
that we created earlier. However, to access the contrasts we need the model parameters, 
which are obtained by executing: 

summary.Im(viagraPlanned) 


9 1 think that creating the contrastl and contrast2 variables makes what we’re doing a bit easier to understand, but 
in reality I would normally create these contrasts by executing this single command: 

contrasts(viagraData$dose)<-cbind(c(-2,l,i), c(0,-1,1)) 
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The resulting Output 10.9 is the same as Output 10.2, which we looked at earlier when 
explaining how these contrasts work. Re-read that earlier material to see from where the 
values of the parameters come. The table gives the standard error of each contrast and a 
t-statistic. The significance value of the contrast is given in the final column, and this value 
is two-tailed. Using the first contrast as an example, if we had used this contrast to test the 
general hypothesis that the experimental groups would differ from the placebo group, then 
we should use this two-tailed value. However, in reality we tested the hypothesis that the 
experimental groups would increase libido above the levels seen in the placebo group: this 
hypothesis is one-tailed. Provided the means for the groups bear out the hypothesis we can 
divide the significance values by 2 to obtain the one-tailed probability (i.e., .0293/2 = .0147). 
Hence, for contrast 1, we can say that taking Viagra significantly increased libido compared 
to the control group (p = .0147). For contrast 2 we also had a one-tailed hypothesis (that a 
high dose of Viagra would increase libido significantly more than a low dose) and the means 
bear this hypothesis out. The significance of contrast 2 tells us that a high dose of Viagra 
increased libido significantly more than a low dose (p(one-tailed) = .0652/2 = .0326). Notice 
that had we not had a specific hypothesis regarding which group would have the highest 
mean, then we would have had to conclude that the dose of Viagra had no significant effect 
on libido. For this reason it can be important as scientists that we generate hypotheses before 
collecting any data, because this method of scientific discovery is more powerful. 

In summary, the planned contrasts revealed that taking Viagra significantly increased 
libido compared to a control group, t( 12) = 2.47, p < .05, and taking a high dose signifi¬ 
cantly increased libido compared to a low dose, t( 12) = 2.03, p < .05 (one-tailed). 

Coefficients: 



Estimate Std. 

Error t value 

Pr(>111 ) 

(Intercept) 

3.4667 

0.3621 9.574 

5.72e-07 *** 

dosel 

0.6333 

0.2560 2.474 

0.0293 * 

dose2 

0.9000 

0.4435 2.029 

0.0652 . 

Signif. codes: 0 ' ***' 0 

.001 ’**' 0.01 

LD 

O 

O 

* 


Residual standard error: 1.402 on 12 degrees of freedom 
Multiple R-squared: 0.4604, Adjusted R-squared: 0.3704 

F-statistic: 5.119 on 2 and 12 DF, p-value: 0.02469 

Output 10.9 

10.6.7.2. Trend analysis © 

To conduct a trend analysis we can use contr.poly(). It is important that we have coded 
the predictor variable groups in a meaningful order. We expect libido to be smallest in the 
placebo group, to increase in the low-dose group and then to increase again in the high- 
dose group. To detect a meaningful trend, we need to have coded these groups in ascend¬ 
ing order. We have done this by coding the placebo group with the lowest value 1, the 
low-dose group with the middle value 2 and the high-dose group with the highest coding 
value of 3. If we coded the groups differently, this would influence both whether a trend is 
detected and, if a trend is detected, whether it is statistically meaningful. 

To obtain a trend analysis we follow the general procedure of setting the contrast attri¬ 
bute of the predictor variable, which in this case we can do by executing: 

contrasts(viagraData$dose)<-contr.poly(3) 

The ‘3’ just tells contr.poly() how many groups there are in the predictor variable. Having 
set the contrast we again create a new model using aov(), by executing: 

viagraTrend<-aov(libido ~ dose, data = viagraData) 
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To access the contrasts we need the model parameters, which are obtained by executing: 
summary.Im(viagraTrend) 

The resulting Output 10.10 breaks down the experimental effect to see whether it can 
be explained by either a linear ( dose.L ) or a quadratic ( dose.Q ) relationship in the data. 
First, let’s look at the linear component. This comparison tests whether the means increase 
across groups in a linear way. The most important things to note are the value of the t and 
the corresponding significance value. For the linear trend t = 3.16 and this value is signifi¬ 
cant at p = .008. Therefore, we can say that as the dose of Viagra increased from nothing 
to a low dose to a high dose, libido increased proportionately. 

Moving onto the quadratic trend, this comparison is testing whether the pattern of 
means is curvilinear (i.e., is represented by a curve that has one bend). The error bar graph 
of the data suggests that the means cannot be represented by a curve and the results for the 
quadratic trend bear this out: t = 0.52 and this value is significant at p = .6 12, which is not 
very significant at all. 


Coefficients: 

Estimate Std. Error t value Pr(>|t|) 


(Intercept) 

3.4667 

0.3621 

9.574 

5.72e-07 *** 


dose. L 

1.9799 

0.6272 

3.157 

0.00827 ** 


dose. Q 

0.3266 

0.6272 

0.521 

0.61201 


Signif. codes: 

. g ■***■ 

0.001 '** 

^—1 
o 

o 

0.05 '.' 0.1 ' 

' 1 


Residual standard error: 1.402 on 12 degrees of freedom 
Multiple R-squared: 0.4604, Adjusted R-squared: 0.3704 

F-statistic: 5.119 on 2 and 12 DF, p-value: 0.02469 

Output 10.10 


10 . 6 . 8 . 


Post hoc tests using R (D 


How you conduct post hoc tests in R depends on which test you’d like to do. Bonferroni 
and related methods (such as Holm and Benjamini-Hochberg) are done using the 
pairwise.t.testQ function, which is part of the R base system. However, Tukey and Dunnett’s 
test (and some others that we’re not going to look at) can be done using the glht() function 
in the multcomp() package. Finally, Wilcox (2005) has some robust methods implemented 
in his functions lincon() and mcpp20(). This section is divided according to these different 
methods. 


10.6.8.1. Bonferroni and related methods © 


Bonferroni and related methods (e.g., Holm, Benjamini-Hochberg, Hommel, Hochberg) 
can be implemented using the pairwise.t.testQ function that is built into R. This function 
takes the general form: 

pairwise.t.test(outcome, predictor, paired = FALSE, p.adjust.method = 

"method") 

in which: 

• outcome is the name of your outcome variable (in this case it will be libido 
(viagraData$ libido). 
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• predictor is the name of your grouping variable (in this case it will be dose 
(viagraData$dose). 

• paired is a logical statement that by default is FALSE but can be set to TRUE (the 
capital letters matter). This specifies whether you want paired t-tests or not. For these 
data we have independent groups so we do not want paired t-tests and the default of 
FALSE is fine, but we’ll revisit this option in Chapter 13. 

• p.adjust.metbod is a string that specifies which correction you would like to apply to 
your p-values. You can replace “method” in the command above with “bonferroni”, 
“holm”, “hocbberg”, “hommel”, “BH” (which produces the Benjamini-Hochberg 
method), “BY” (which produces the more recent Benjamini-Yekutieli method), “fdr ” 
(the general false discovery rate method), and “none” (you don’t correct the p-value 
at all, you just do lots of t-tests - not advisable). 

As such, we can obtain Bonferroni and Benjamini-Hochberg post hoc tests for the current 
data by executing these two commands: 

pairwise.t.test(viagraData$libido, viagraData$dose, p.adjust.method 

"bonferroni") 

pairwise.t.test(viagraData$libido, viagraData$dose, p.adjust.method = "BH") 

Both commands specify libido as the outcome variable, and dose as the grouping vari¬ 
able, but they differ in the method that is set for correcting the p-values. The results can 
be seen in Output 10.11. Both methods produce a grid of p-values for all combinations of 
the groups. First of all, let’s look at the Bonferroni corrected values: the placebo group is 
compared to the low-dose group and reveals a non-significant difference (.845 is greater 
than .05), but when compared to the high-dose group there is a significant difference (.025 
is less than .05). 



SELF-TEST 

s Our planned comparison showed that any dose of 
Viagra produced a significant increase in libido, yet 
the post hoc tests indicate that a low dose does not. 
Why is there this contradiction? 


In section 10.4.2, I explained that the first planned comparison would compare the 
experimental groups to the placebo group. Specifically, it would compare the average of 
the two group means of the experimental groups ((3.2 + 5.0)/2 = 4.1) to the mean of 
the placebo group (2.2). So, it was assessing whether the difference between these values 
(4.1 — 2.2 = 1.9) was significant. In the post hoc tests, when the low dose is compared 
to the placebo, the contrast is testing whether the difference between the means of these 
two groups is significant. The difference in this case is only 1, compared to a difference 
of 1.9 for the planned comparison. This explanation illustrates how it is possible to have 
apparently contradictory results from planned contrasts and post hoc comparisons. More 
important, it illustrates how careful we must be in interpreting planned contrasts. 

The final comparison is the low-dose group compared to the high-dose group, which 
is not significant (because 0.196 is greater than .05). This result contradicts the planned 
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comparisons (remember that contrast 2 compared these groups and found a significant 
difference). 



SELF-TEST 

s Why does the post hoc test show a non-significant 
difference between high and low dose, when 
the planned comparison showed a significant 
difference? 


This contradiction occurs for two possible reasons. First, post hoc tests by their nature 
are two-tailed (you use them when you have made no specific hypotheses and you cannot 
predict the direction of hypotheses that don’t exist!) and contrast 2 was significant only 
when considered as a one-tailed hypothesis. However, even at the two-tailed level the 
planned comparison was closer to significance than the post hoc test and this fact illustrates 
that post hoc procedures are more conservative (i.e., have less power to detect true effects) 
than planned comparisons. 

Looking now at the BH corrected tests, we find the same pattern of results as for 
Bonferroni: placebo is significantly different from a high dose (because .025 is less than 
.05), but not a low dose (.282 is greater than .05) and low and high doses did not signifi¬ 
cantly differ (.098 is greater than .05). 


Bonferroni 


Pairwise comparisons using t 
tests with pooled SD 

data: viagraData$libido and 

viagraData$dose 

Placebo Low Dose 
Low Dose 0.845 
High Dose 0.025 0.196 

P value adjustment method: 
bonferroni 


BH 


Pairwise comparisons using t 
tests with pooled SD 

data: viagraData$libido and 

viagraData$dose 

Placebo Low Dose 
Low Dose 0.282 
High Dose 0.025 0.098 

P value adjustment method: BH 


Output 10.11 


10.6.8.2. Tukey and Dunnett (D 

Tukey and Dunnett can be implemented using the glht() function that is part of the mult- 
comp package (so remember to install and load it). This function takes the general form: 

newModel<-glht(aov.Model, linfct = mcp(predictor = "method"), base = x) 

in which: 

• newModel is an object containing the information from the post hoc tests. To see 
this information we can use summary (newModel) for the basic post hoc tests and 
confint(newModel) to see the confidence intervals. 
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• aov.Model is the name of a model that has already been created with the aov() func¬ 
tion (in this case it will be viagraModel). 

• predictor is the name of your grouping variable (in this case it will be dose 
(viagraData$dose). 

• linfct = mcp(predictor = “method”) specifies which correction you would like to apply 
to your p-values. You can replace “ method ” in the command above with “Dunnett ”, 
“ Tukey ”, “ Sequen ”, “AVE”, “ Changepoint ”, “ Williams ”, “ Marcus ”, “ McDermott ”, 
“UmbrellaWilliams” , and “GrandMean”. 

• base is used only when “ Dunnett ” is specified. This option allows you to specify 
the baseline group using a group number. In this case if we wanted the placebo as 
the baseline we would use base = 1, but if we wanted the high-dose group we could 
specify base = 3. 


For the Viagra data, we can obtain Tukey post hoc tests by executing: 

postHocs<-glht(viagraModel, linfct = mcp(dose = "Tukey")) 

summary(postHocs) 

confint(postHocs) 

The first command creates an object (which I’ve called postHocs ) that is based on the via¬ 
graModel that we created in section 10.6.6.1. The linfct command is set to perform Tukey 
tests on the variable dose (the reason why we can type ‘dose’ rather than ‘viagraData$dose’ 
is because the function will look for ‘dose’ within viagraModel, which has been specified 
within the function). To access the information within postHocs we execute summary() to 
get the post hoc tests (Output 10.12) and confint() to get the corresponding confidence 
intervals (Output 10.13). 

Output 10.12 shows the three comparisons (low dose vs. placebo, high dose vs. pla¬ 
cebo, high dose vs. low dose), the estimate (which is the difference between the group 
means), the standard error associated with the difference between means, the 7-test 
(which is simply the difference between means divided by the standard error, so for the 
first contrast it is 1/0.8869 = 1.127), and its associated p-value. As with the tests in the 
previous section, this output confirms significant differences between the high dose and 
placebo groups, t = 3.16, p < .05, but not between the low-dose group and the placebo, 
t = 1.13, p = .52, and high dose, t = 2.03, p = .15, groups. The confidence intervals 
(Output 10.13) also confirm this because they do not cross zero for the comparison of 
the high dose and placebo group, which means that the true difference between group 
means is likely not to be zero (i.e., no difference); conversely, for the other contrasts the 
confidence intervals cross zero, implying that the true difference between means could 
be zero. 


Simultaneous Tests for General Linear Hypotheses 
Multiple Comparisons of Means: Tukey Contrasts 


Fit: aov(formula = libido - dose, data = viagraData) 


Linear Hypotheses: 

Low Dose - Placebo == 0 
High Dose - Placebo == 0 
High Dose - Low Dose == 0 


Estimate Std. Error t value 
1.0000 0.8869 1.127 

2.8000 0.8869 3.157 

1.8000 0.8869 2.029 


Pr(>|t|) 
0.5162 
0.0208 * 
0.1474 


Signif. codes: 0 '***' 0.001 '**' 0.01 0.05 '0.1 ' ' 1 

(Adjusted p values reported -- single-step method) 


Output 10.12 
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Simultaneous Confidence Intervals 


Multiple Comparisons of Means: Tukey Contrasts 


Fit: aov(formula = libido ~ dose, data = viagraData) 


Quantile = 2.6671 

95% family-wise confidence level 


Linear Hypotheses: 

Low Dose - Placebo == 
High Dose - Placebo = 
High Dose - Low Dose 


Estimate 
0 1.0000 
0 2.8000 
= 0 1.8000 


lwr upr 

-1.3656 3.3656 

0.4344 5.1656 

-0.5656 4.1656 


Output 10.13 

We can obtain Dunnett post hoc tests for the Viagra data by executing: 

postHocs<-glht(viagraModel, linfct = mcp(dose = "Dunnett"), base = 1) 

summary(postHocs) 

confint(postHocs) 

The first command is the same as before, except that we have replaced “Tukey” with “Dunnett”. 
We have also added the base command (because we’re using Dunnett) to specify which group 
to use as the control group. We have used base = 1, which means ‘use the first group’, which 
in this case is the placebo group. To access the information we again execute summaryQ and 
confint(). The results are in Output 10.14.1 won’t labour the point because the conclusions are 
the same as for Tukey; all I will say is that you should note that Dunnett’s test compares groups 
to a baseline so we end up with two tests rather than three. In this case we asked every group to 
be compared to the placebo group, so there is no comparison of the high and low-dose groups. 


Simultaneous Tests for General Linear Hypotheses 


Multiple Comparisons of Means: Dunnett Contrasts 
Fit: aov(formula = libido ~ dose, data = viagraData) 


Linear Hypotheses: 

Low Dose - Placebo = 
High Dose - Placebo 


Estimate Std. Error t value Pr(>|t|) 

0 1.0000 0.8869 1.127 0.4459 

0 2.8000 0.8869 3.157 0.0152 * 


Signif. codes: 0 '***' 0.001 '**■ 0.01 0.05 0.1 ' ' 1 

(Adjusted p values reported -- single-step method) 


Simultaneous Confidence Intervals 


Multiple Comparisons of Means: Dunnett Contrasts 

Fit: aov(formula = libido ~ dose, data = viagraData) 

Quantile = 2.5023 

95% family-wise confidence level 

Linear Hypotheses: 

Estimate lwr upr 

Low Dose - Placebo == 0 1.0000 -1.2194 3.2194 

High Dose - Placebo == 0 2.8000 0.5806 5.0194 


Output 10.14 
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10.6.8.3. Run for cover - it’s robust post hoc tests (D 


As with the robust ANOVA, to run robust post hoc tests we need to (1) source Rand Wilcox’s 
functions (see section 10.6.6.3 for how to do this); and (2) input data in the wide format - 
therefore, we’ll use the object viagraWide that we created in section 10.6.6.3. We are going 
to use two functions: lincon(), which is based on trimmed means; and mcppb20(), which 
uses a percentile bootstrap to compute p-values as well as trimming the group means. The 
latter method, in particular, seems good at controlling the Type I error rate. The general 
forms of these functions are similar to tlway() and tlwaybt(), which we encountered ear¬ 
lier in the chapter: 10 

lincon(dataframe, tr = .2, grp = c(x, y, ..., z)) 
mcppb20(dataframe, tr = .2, nboot = 2000, grp = c(x, y, ..., z)) 

The options for each function are the same as described in section 10.6.6.3. Note that 
these functions take the same parameters, except that mcppb20() has an additional nboot 
command to control the number of bootstrap samples (the default is 2000, which is fine). 
Trimming on the means defaults to 20% (tr = .2). If you are happy with the default values 
then we can execute these commands on the viagraWide dataframe as follows: 

lincon(viagraWide) 

mcppb20(viagraWide) 

It’s as easy as that. Output 10.15 comes from lincon(). Note that the confidence intervals 
are corrected for the number of tests, but the p-values are not. As such, we should ascertain 
significance from whether or not the confidence intervals cross zero. In this case they all 
do, which implies that none of the groups are significantly different. This is different from 
what we found when we did not trim the means (see the previous two sections). 



SELF-TEST 

s Repeat the analysis with 10% trimmed means. How 
do your conclusions differ? 


[1] "Note: confidence intervals are adjusted to control FWE 
[1] "But p-values are not adjusted to control FWE" 

$test 

Group Group test crit se df 


[1, ] 

1 

2 

0.8660254 3.74 1 

.154701 

4 

[2, ] 

1 

3 

2.5980762 3.74 1 

.154701 

4 

[3, ] 

$psihat 

2 

3 

1.7320508 3.74 1 

.154701 

4 

Group Group 

psihat ci.lower 

ci.upper 

p.value 

[1, ] 

1 

2 

-1 -5.31858 

3.31858 

0.43533094 

[2, ] 

1 

3 

-3 -7.31858 

1.31858 

0.06016985 

[3, ] 

2 

3 

-2 -6.31858 

2.31858 

0.15830242 

Output 

10.15 






10 They actually have a few extra options, but I’m keeping things simple. 
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Output 10.16 comes from mcpp20(). Unlike lincon(), both the confidence intervals and 
p-values are corrected for the number of tests. The main table lists three contrasts. To make 
sense of these we have to look at the contrast codes listed under $con. These are like the 
contrast weights that we looked at earlier in the chapter, so groups with positive weights 
are compared to those with negative weights. From the contrast codes we can see that con¬ 
trast 1 compares groups 1 and 2 (i.e., placebo vs. low dose), contrast 2 compares groups 1 
and 3 (i.e., placebo vs. high dose), and contrast 3 compares groups 2 and 3 (i.e., low dose 
vs. high dose). 

Looking at the confidence intervals, it’s clear that only the interval for contrast 2 does 
not cross zero, implying a significance difference between the high dose and placebo group 
(which is confirmed by the associated p-value, which is smaller than .05). For the other two 
comparisons the confidence intervals cross zero (and the ps are greater than .05), implying 
non-significant differences in libido between the low-dose group and both placebo (con¬ 
trast 1) and high-dose (contrast 3) groups. Essentially, this profile of results is consistent 
with what we found using non-robust post hoc tests. 


[1] "Taking bootstrap samples. Please wait. 
$psihat 



con.num 

psihat 

se 

ci.lower 

ci.upper 

p-value 

[1,1 

1 

-1 

1.154701 

-3.333333 

1.3333333 

0.3250 

[2, ] 

2 

-3 

1.154701 

-5.333333 

-0.3333333 

0.0055 

[3, ] 

3 

-2 

1.154701 

-4.333333 

0.6666667 

0.0840 


$crit.p.value 
[1] 0.017 

$con 

[, 1 ] [, 2 ] [, 3 ] 

[ 1 ,] 110 

[ 2 ,] -101 
13,] 0 -1 -1 

Output 10.16 



CRAMMING SAM’S TIPS 


One-way ANOVA 


• The one-way independent ANOVA compares several means, when those means have come from different groups of 
people; for example, if you have several experimental conditions and have used different participants in each condition. 

• When you have generated specific hypotheses before the experiment use planned comparisons, but if you don’t have 
specific hypotheses use post hoc tests. 

• There are lots of different post hoc tests: when you have equal sample sizes and homogeneity of variance is met, use 
Tukey’s HSD. If there is any doubt about the underlying assumptions then use a robust method. 

• Test for homogeneity of variance using Levene’s test. Find the table with this label: if the p-value is less than .05 then the 
assumption is violated. If homogeneity of variance has been met (the significance of Levene’s test is greater than .05), 
run a normal ANOVA. If, however, the assumption is violated (the significance of Levene’s test is less than .05) compute 
Welch’s F instead of the normal ANOVA, or use a robust method based on trimmed means and/or a bootstrap. 

• In the main ANOVA, if the value of p is less than .05 then the means of the groups are significantly different. 

• For contrasts and post hoc tests, look at the confidence intervals and p-values to discover if your comparisons are sig¬ 
nificant. If the confidence intervals do not contain zero or the p-value is less than .05 then the effect is significant. 
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Scraping the barrel? © 


Gallup, G. G. J., et al. (2003). Evolution and Human Behavior, 24, 277-289. 

Evolution has endowed us with many beautiful things (cats, dolphins, the Great Barrier Reef, etc.), all selected 
to fit their ecological niche. Given evolution’s seemingly limitless capacity to produce beauty, it’s something of a 
wonder how it managed to produce such a monstrosity as the human penis. One theory is that the penis evolved 
into the shape that it is because of sperm competition. Specifically, the human penis has an unusually large 
glans (the ‘bell end’, as it’s affectionately known) compared to other primates, and this may have evolved so that 
the penis can displace seminal fluid from other males by ‘scooping it out' during intercourse. To put this idea to 
the test, Gordon Gallup and his colleagues came up with an ingenious study (Gallup et al., 2003). Armed with 
various female masturbatory devices from Hollywood Exotic Novelties, an artificial vagina from California Exotic 
Novelties, and some water and cornstarch to make fake sperm, they loaded the artificial vagina with 2.6 ml of fake 
sperm and inserted one of three female sex toys into it before withdrawing it. Over several trials, three different 
female sex toys were used: a control phallus that had no coronal ridge (i.e., no bell end), a phallus with a minimal 
coronal ridge (small bell end) and a phallus with a coronal ridge. 

They measured sperm displacement as a percentage using the following equation (included here because it 
is more interesting than all of the other equations in this book): 



weight of vagina with semen - weight of empty vagina 


As such, 100% means that all of the sperm was displaced by the phallus, and 0% means that none of the 
sperm was displaced. If the human penis evolved as a sperm displacement device, then Gallup et al. predicted: 
(1) that having a bell end would displace more sperm than not; and (2) the phallus with the larger coronal ridge 
would displace more sperm than the phallus with the minimal coronal ridge. The conditions are ordered (no ridge, 
minimal ridge, normal ridge) so we might also predict a linear trend. The data can be found in the file Gallup 
et al.csv. Draw an error bar graph of the means of the three conditions. Conduct a one-way ANOVA with 
planned comparisons to test the two hypotheses above. What did Gallup et al. find? 

Answers are in the additional material on the companion website (or look at pages 280-281 in the 
original article). 


10.7. Calculating the effect size © 


One thing you will notice is that R doesn’t routinely provide an effect size for one-way 
independent ANOVA. However, we saw in equation (7.4) that: 


SS 

_ JJ M 


SS T 


We can actually get this value from the main ANOVA by using summary.lm() on the object 
you create with aov(). For example, for the viagraModel this function gives us Output 
10.8, at the bottom of which we see that r 2 = .46. For some bizarre reason, in the context 
of ANOVA, r 2 is usually called eta squared, rf. It is then a simple matter to take the square 
root of this value to give us the effect size, r (V.46 = .68). Using the benchmarks for effect 
sizes this represents a large effect (it is above the .5 threshold for a large effect). Therefore, 
the effect of Viagra on libido is a substantive finding. 
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However, this measure of effect size is slightly biased because it is based purely on sums 
of squares from the sample and no adjustment is made for the fact that we’re trying to 
estimate the effect size in the population. Therefore, we often use a slightly more complex 
measure called omega squared ( co 2 ). This effect size estimate is still based on the sums of 
squares that we’ve met in this chapter, but like the T-ratio it uses the variance explained 
by the model, and the error variance (in both cases the average variance, or mean squared 
error, is used): 

2 SS M - (df M )MS r 

CO = - 

ss T + ms r 

All of these values can be found in Output 10.5 (although SS T is not in the output, it is eas¬ 
ily calculated as SS T = SS M + SS R ). In this example we’d get: 

, 20.13 - (2 x 1.97), or 20.13 - (2)1.97 

co =- 

43.73 + 1.97 

_ 16.19 
“ 45.70 
= .35 

co = .60 


As you can see, this has led to a slightly lower estimate than using r, and in general co is a 
more accurate measure. Although in the sections on ANOVA I will use co as my effect size 
measure, think of it as you would r (because it’s basically an unbiased estimate of r anyway). 
People normally report co 1 , and it has been suggested that values of .01, .06 and .14 rep¬ 
resent small, medium and large effects respectively (Kirk, 1996). Remember, though, that 
these are rough guidelines and that effect sizes need to be interpreted within the context 
of the research literature. 



OLIVER TWISTED 

Please Sir, can I have some 
more ... omega? 


‘There's no place like omega’, chants Oliver as he clicks the heels 
of his red shoes together. Much as you want to wake up in Kansas, 
Oliver, you’re going to find yourself in bubo-infested Dickensian 
London. If you’d like to join him there, read the online material, 
which shows you how to write a function to calculate a/ in R. I 
think you’ll agree it’s not entirely different from a bubo infestation. 


Most of the time it isn’t that interesting to have effect sizes for the overall ANOVA 
because it’s testing a general hypothesis. Instead, we really want effect sizes for the differ¬ 
ences between pairs of groups. We can obtain these using the mes() function of the calcu.- 
late.es package. This function takes the general form: 

mes(mean groupl , mean grilU p 2 ) ^dg rou pi, s d groupZ , n group i, ^groupz^ 

In other words, we simply input the mean, standard deviation ( sd ) and sample size ( n ) of 
the two groups that we want to compare. We have this information in Output 10.3. For 
example, if we want to compare the placebo and low-dose group we would execute: 

mes(2.2, 3.2, 1.3038405, 1.3038405, 5, 5) 
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We have entered the mean of the placebo group (2.2), the mean of the low-dose group 
(3.2), the standard deviation of the placebo group (1.3038), the standard deviation of the 
low-dose group (also 1.3038), and both groups have a sample size of 5. Similarly, we can 
get effect sizes for the difference between the placebo and high-dose group by executing: 

mes(2.2, 5, 1.3038405, 1.5811388, 5, 5) 

Finally, the difference between the low- and high-dose groups can be quantified by 
executing: 

mes(3.2, 5, 1.3038405, 1.5811388, 5, 5) 

The outputs of these commands are shown in Output 10.17 (I have edited them to show 
only the effect sizes d and r). The difference between the placebo and low-dose group is a 
medium-sized effect (the means are about three-quarters of a standard deviation different), 
d = -0.77, r = -.36; the difference between the placebo and high-dose group is a very large 
effect (a difference between the group means of almost 2 standard deviations), d = —1.93, 
r = -.69; finally, the difference between the low- and high-dose groups is a largish effect 
(more than a standard deviation difference between the group means), d = -1.24, r = -.53. 

Placebo vs. Low Dose: 

$MeanDifference 

d var. d g var. g 

-0.7669650 0.4294118 -0.6927426 0.3503214 

$Correlation 

r var.r 

-0.35805743 0.07113067 

Placebo vs. High Dose: 

$MeanDifference 

d var. d g var. g 

-1.9321836 0.5866667 -1.7451981 0.4786126 

$Correlation 

r var.r 

-0.69480834 0.02029603 

Low Dose vs. High Dose: 

$MeanDifference 

d var. d g var. g 

-1.2421180 0.4771429 -1.1219130 0.3892612 

$Correlation 

r var.r 

-0.52758935 0.04482986 

Output 10.17 

An alternative is to compute effect sizes for the orthogonal contrasts. We can use the 
same equation as in section 9.5.2.8: 
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We could write a function (see R’s Souls’ Tip 6.2) to do this computation for us in R: 

rcontrast<-function(t, df) 

{r<-sqrt(t A 2/(t A 2 + df)) 

print(paste("r = ", r)) 

> 

Executing this command creates a function called rcontrast. First, we tell R that we want 
to be able to input t and df into the function (these are specified in brackets). This means 
that to use the function we have to input these values in brackets in the correct order. 
The rest of the function uses these values to compute r and then print the result. The first 
command takes the value of t and df entered into the function and places them into the 
equation written above in R-speak (because of how I have labelled everything in the func¬ 
tion you should be able to compare directly the command with the equation above) to get 
a value of r. The command prints some text (in speech marks) followed by the value of r. 
If you can’t be bothered to write out the command, you should be able to use it directly if 
you have the package associated with this book, DSUR, loaded (see section 3.4.5). 

Having executed this function, we can use it to calculate r for the contrasts. Output 10.9 
gives us the value of t for each contrast (2.474 and 2.029). The degrees of freedom can be 
calculated as in normal regression (see section 7.2.4) as N - p - 1, in which N is the total 
sample size (in this case 15), and p is the number of predictors (in this case 2, the two con¬ 
trast variables). Therefore, the degrees of freedom are 15 - 2 - 1 = 12. Therefore, we can 
execute the following commands: 

rcontrast(2.474, 12) 
rcontrast(2.029, 12) 

The resulting values of r are 

[1] "r = 0.581182458413787" 

[1] "r = 0.505407970122564" 

Both effects are fairly large. 


10.8. Reporting results from one-way 
independent ANOVA © 


When we report an ANOVA, we have to give details of the F-ratio and the degrees of free¬ 
dom from which it was calculated. For the experimental effect in these data the F-ratio was 
derived by dividing the mean squares for the effect by the mean squares for the residual. 
Therefore, the degrees of freedom used to assess the F-ratio are the degrees of freedom for 
the effect of the model ( df u = 2) and the degrees of freedom for the residuals of the model 
(df R = 12). Therefore, the correct way to report the main finding would be: 

V There was a significant effect of Viagra on levels of libido, F(2, 12) = 5.12, p < .05, 
a> = .60. 

Notice that the value of the F-ratio is preceded by the values of the degrees of freedom 
for that effect. Also, we rarely state the exact significance value of the F-ratio: instead we 
report that the significance value, p, was less than the criterion value of .05 and include an 
effect size measure. The linear contrast can be reported in much the same way: 

• There was a significant linear trend, F(l, 12) = 9.97, p < .01, io = .62, indicating that 
as the dose of Viagra increased, libido increased proportionately. 
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Notice that the degrees of freedom have changed to reflect how the F-ratio was calculated. 
I’ve also included an effect size measure (have a go at calculating this as we did for the main 
F-ratio and see if you get the same value). Also, we have now reported that the F-value was 
significant at a value less than the criterion value of .01. We can also report our planned 
contrasts or group comparisons: 

• Planned contrasts revealed that taking any dose of Viagra significantly increased 
libido compared to having a placebo, t{ 12) = 2.47, p < .05 (one-tailed), and that tak¬ 
ing a high dose significantly increased libido compared to taking a low dose, t{ 12) = 
2.03, p < .05 (one-tailed). 

• Despite fairly large effect sizes, Bonferroni tests revealed non-significant differences 
between the low-dose group and both the placebo, p — .845, d = -0.77, and high- 
dose, p = .196, d = —1.24, groups. The high-dose group, however, had a mean almost 
2 standard deviations bigger than the placebo group, p = .025, d = —1.93. 




What have I discovered about statistics? © 


This chapter has introduced you to analysis of variance (ANOVA), which is the topic of 
the next few chapters also. One-way independent ANOVA is used in situations when you 
want to compare several means, and you’ve collected your data using different partici¬ 
pants in each condition. I started off explaining that if we just do lots of t-tests on the 
same data then our Type I error rate becomes inflated. Hence we use ANOVA instead. 
I looked at how ANOVA can be conceptualized as a general linear model (GLM) and 
so is in fact the same as multiple regression. Like multiple regression, there are three 
important measures that we use in ANOVA: the total sum of squares, SS T (a measure 
of the variability in our data), the model sum of squares, SS M (a measure of how much 
of that variability can be explained by our experimental manipulation), and SS R (a mea¬ 
sure of how much variability can’t be explained by our experimental manipulation). We 
discovered that, crudely speaking, the F-ratio is just the ratio of variance that we can 
explain to the variance that we can’t. We also discovered that a significant F-ratio tells 
us only that our groups differ, not how they differ. To find out where the differences lie 
we have two options: specify specific contrasts to test hypotheses {planned contrasts ), or 
test every group against every other group {post hoc tests). The former are used when 
we have generated hypotheses before the experiment, whereas the latter are for explor¬ 
ing data when no hypotheses have been made. Finally, we discovered how to implement 
these procedures in R. 

We also saw that my life was changed by a letter that popped through the letterbox 
one day saying only that I could go to the local grammar school if I wanted to. When 
my parents told me, rather than being in celebratory mood, they were very downbeat; 
they knew how much it meant to me to be with my friends and how I had got used to my 
apparent failure. Sure enough, my initial reaction was to say that I wanted to go to the 
local school. I was unwavering in this view. Unwavering, that is, until my brother con¬ 
vinced me that being at the same school as him would be really cool. It’s hard to measure 
how much I looked up to him, and still do, but the fact that I willingly subjected myself 
to a lifetime of social dysfunction just to be with him is a measure of sorts. As it turned 
out, being at school with him was not always cool - he was bullied for being a boffin (in 
a school of boffins) and being the younger brother of a boffin made me a target. Luckily, 
unlike my brother, I was not a boffin and played football, which seemed to be good 
enough reasons for them to leave me alone. Most of the time. 
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R packages used in this chapter 


car 

pastecs 

compute.es 

Rcmdr 

ggplot2 

WRS 

multcomp 


R functions used in this chapter 

aov() 

mcppb20() 

by() 

medlwayO 

cbind() 

mes() 

contrasts!) 

oneway.test() 

contr.helmert() 

pairwise.t.test() 

contr.poly() 

read.csv() 

contr.SAS() 

read.delim() 

contr.treatment() 

stat.desc() 

gi() 

summary() 

gihto 

summary.lm() 

levene.test() 

t1way() 

lincon() 

t1waybt() 

ImO 

unstackQ 

Key terms that I’ve discovered 


Analysis of variance (ANOVA) 
Bonferroni correction 
Cubic trend 
Eta squared, r\ z 
Experimentwise error rate 
Familywise error rate 
Grand variance 
Harmonic mean 
Helmert contrast 
Independent ANOVA 
Omega squared [of) 


Orthogonal 
Pairwise comparisons 
Planned contrasts 
Polynomial contrast 
Post hoc tests 
Quadratic trend 
Quartic trend 
Treatment contrast 
Weights 
Welch’s F 


Smart Alex’s tasks 



• Task 1: Imagine that I was interested in how different teaching methods affected 
students’ knowledge. I noticed that some lecturers were aloof and arrogant in their 
teaching style and humiliated anyone who asked them a question, while others 
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were encouraging and supporting of questions and comments. I took three statistics 
courses where I taught the same material. For one group of students I wandered 
around with a large cane and beat anyone who asked daft questions or got questions 
wrong {punish). In the second group I used my normal teaching style, which is to 
encourage students to discuss things that they find difficult and to give anyone work¬ 
ing hard a nice sweet {reward). The final group I remained indifferent to and neither 
punished nor rewarded students’ efforts {indifferent). As the dependent measure I 
took the students’ exam marks {percentage). Based on theories of operant condition¬ 
ing, we expect punishment to be a very unsuccessful way of reinforcing learning, but 
we expect reward to be very successful. Therefore, one prediction is that reward will 
produce the best learning. A second hypothesis is that punishment should actually 
retard learning such that it is worse than an indifferent approach to learning. The 
data are in the file Teach.dat. Carry out a one-way ANOVA and use planned compari¬ 
sons to test the hypotheses that: (1) reward results in better exam results than either 
punishment or indifference; and (2) indifference will lead to significantly better exam 
results than punishment. © 

Task 2: Earlier in this chapter we encountered some data relating to children’s inju¬ 
ries while wearing superhero costumes. Children reporting to the emergency centre 
at hospitals had the severity of their injury (injury) assessed (on a scale from 0, no 
injury, to 100, death). In addition, a note was taken of which superhero costume they 
were wearing (hero): Spiderman, Superman, the Hulk or a Teenage Mutant Ninja 
Turtle. Use one-way ANOVA and multiple comparisons to test the hypotheses that 
different costumes are associated with more severe injuries. © 

Task 3: In Chapter 15 (section 15.6) there are some data looking at whether eating 
soya meals reduces your sperm count. Have a look at this section, access the data for 
that example, but analyse them with ANOVA. What’s the difference between what 
you find and what is found in section 15.6.4? Why do you think this difference has 
arisen? © 

Task 4: Students (and lecturers for that matter) love their mobile phones, which is 
rather worrying given some recent controversy about links between mobile phone 
use and brain tumours. The basic idea is that mobile phones emit microwaves, and 
so holding one next to your brain for large parts of the day is a bit like sticking your 
brain in a microwave oven and hitting the ‘cook until well done’ button. If we wanted 
to test this experimentally, we could get six groups of people and strap a mobile 
phone to their heads (so that they can’t remove it). Then, by remote control, we turn 
the phones on for a certain amount of time each day. After 6 months, we measure the 
size of any tumour (in mm 3 ) close to the site of the phone antenna (just behind the 
ear). The six groups experienced 0, 1, 2, 3, 4 or 5 hours per day of phone microwaves 
for 6 months. The data are in Tumour.dat (from Field &C Hole, 2003, so there is a 
very detailed answer in there). © 

Task 5: Using the Glastonbury data from Chapter 7 (GlastonburyFestivalRegression. 
dat), carry out a one-way ANOVA on the data to see if the change in hygiene (change) 
is significantly different across people with different musical tastes (music). Do a 
contrast to compare each group against ‘No Affiliation’. Compare the results to those 
described in section 7.12. © 

Task 6: Labcoat Leni’s Real Research 15.2 describes an experiment ((^etinkaya &C 
Domjan, 2006) on quails with fetishes for terrycloth objects (really, it does). In this 
example, you are asked to analyse two of the variables that they measured with a 
Kruskal-Wallis test. However, there were two other outcome variables (time spent 
near the terrycloth object and copulatory efficiency). These data can be analysed 
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with one-way ANOVA. Read Labcoat Leni’s Real Research 15.2 to get the full story, 
then carry out two one-way ANOVAs and Bonferroni post hoc tests on the aforemen¬ 
tioned outcome variables. © 

Answers can be found on the companion website. 



Further reading 


Howell, D. C. (2006). Statistical methods for psychology (6th ed.). Belmont, CA: Duxbury. (Or you 
might prefer his Fundamental statistics for the behavioral sciences, also in its 6th edition, 2007. 
Both are excellent texts that provide very detailed coverage of the standard variance approach to 
ANOVA but also the GLM approach that I have discussed.) 

Iversen, G. R., & Norpoth, H. (1987). ANOVA (2nd ed.). Sage University Paper Series on Quantitative 
Applications in the Social Sciences, 07-001. Newbury Park, CA: Sage. (Quite high level, but a 
good read for those with a mathematical brain.) 

Klockars, A. J., 8c Sax, G. (1986). Multiple comparisons. Sage University Paper Series on Quantitative 
Applications in the Social Sciences, 07-061. Newbury Park, CA: Sage. (High-level but thorough 
coverage of multiple comparisons - in my opinion this book is better than Toothaker for planned 
comparisons.) 

Rosenthal, R., Rosnow, R. L., 8c Rubin, D. B. (2000). Contrasts and effect sizes in behavioural 
research: A correlational approach. Cambridge: Cambridge University Press. (Fantastic book on 
planned comparisons by three of the great writers on statistics.) 

Rosnow, R. L., 8c Rosenthal, R. (2005). Beginning behavioral research: A conceptual primer (5th ed.). 
Upper Saddle River, NJ: Pearson/Prentice Hall. (Look, they wrote another great book!) 

Toothaker, L. E. (1993). Multiple comparison procedures. Sage University Paper Series on Quantitative 
Applications in the Social Sciences, 07-089. Newbury Park, CA: Sage. (Also high level, but gives 
an excellent precis of post hoc procedures.) 

Wright, D. B.,8c London, K. (2009). First steps in statistics (2nd ed.). London: Sage. (If this chapter 
is too complex then Wright and London’s book is a very readable basic introduction to ANOVA.) 


Interesting real research 


Davies, P., Surridge, J., Hole, L., 8c Munro-Davies, L. (2007). Superhero-related injuries in paediat¬ 
rics: A case series. Archives of Disease in Childhood, 92(3), 242-243. 

Gallup, G. G. J., Burch, R. L., Zappieri, M. L., Parvez, R., Stockwell, M., 8c Davis, J. A. (2003). 
The human penis as a semen displacement device. Evolution and Human Behavior, 24, 277-289. 
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FIGURE 11.1 

Davey Murray 
(guitarist from 
Iron Maiden) and 
me backstage in 
London in 1986; 
my grimace 
reflects the utter 
terror I was feeling 
at meeting my hero 



11.1. What will this chapter tell me? © 


My road to rock stardom had taken a bit of a knock with my unexpected entry to an all-boys 
grammar school (rock bands and grammar schools really didn’t go together). I needed to be 
inspired and I turned to the masters: Iron Maiden. I first heard Iron Maiden at the age of 
11 when a friend of mine lent me Piece of Mind and told me to listen to ‘The Trooper’. It 
was, to put it mildly, an epiphany. I became their smallest (I was 11) biggest fan and started 
to obsess about them in the unhealthiest way possible. I started stalking the man who ran 
their fan club with letters, and, bless him, he replied. Eventually this stalking paid off and 
he arranged for me to go backstage when they played the Hammersmith Odeon in London 
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(now the Apollo Hammersmith) on 5 November 1986 (Somewhere on Tour , in case you’re 
interested). Not only was it the first time that I had seen them live, but also I got to meet 
them. It’s hard to put into words how bladder-splittingly exciting this was. I was so utterly 
awe-struck that I managed to say precisely no words to them. As usual, then, a social situa¬ 
tion provoked me to make an utter fool of myself. 1 When it was over I was in no doubt that 
this was the best day of my life. In fact, I thought, I should just kill myself there and then 
because nothing would ever be as good as that again. 2 This may be true, but I have subse¬ 
quently had many other very nice experiences, so who is to say that they were not better? I 
could compare experiences to see which one is the best, but there is an important confound: 
my age. At the age of 13, meeting Iron Maiden was bowel-weakeningly exciting, but adult¬ 
hood (sadly) dulls your capacity for this kind of unqualified joy of life. Therefore, to really 
see which experience was best, I would have to take account of the variance in enjoyment 
that is attributable to my age at the time. This will give me a purer measure of how much 
variance in my enjoyment is attributable to the event itself. This chapter describes analysis of 
covariance, which extends the basic idea of ANOVA from the previous chapter to situations 
when we want to factor in other variables that influence the outcome variable. 


11.2. What is ANCOVA? © 


In the previous chapter we saw how one-way ANOVA could be characterized in 
terms of a multiple regression equation that used dummy variables to code group 
membership. In addition, in Chapter 7 we saw how multiple regression could incor¬ 
porate several continuous predictor variables. It should, therefore, be no surprise 
that the regression equation for ANOVA can be extended to include one or more 
continuous variables that predict the outcome (or dependent variable). Continuous 
variables such as these, that are not part of the main experimental manipulation but 
have an influence on the dependent variable, are known as covariates and they can 
be included in an ANOVA analysis. When we measure covariates and include them 
in an analysis of variance we call it analysis of covariance (or ANCOVA for short). 

This chapter focuses on this technique. 

In the previous chapter we used an example looking at the effects of Viagra on libido. 
Let’s think about things other than Viagra that might influence libido: well, the obvious 
one is the libido of the participant’s sexual partner (after all, ‘it takes two to tango’), but 
there are other things too such as medication (antidepressants or the contraceptive pill) 
and fatigue that suppress libido. If these variables (the covariates) are measured, then it is 
possible to control for the influence they have on the dependent variable by including them 
in the regression model. From what we know of hierarchical regression (see Chapter 7), it 
should be clear that if we enter the covariate into the regression model first, and then enter 
the dummy variables representing the experimental manipulation, we can see what effect 
an independent variable has after the effect of the covariate. As such, we partial out the 
effect of the covariate. There are two reasons for including covariates in ANOVA: 

• To reduce within-group error variance: In the discussion of ANOVA and t-tests we 
got used to the idea that we assess the effect of an experiment by comparing the 
amount of variability in the data that the experiment can explain against the variabil¬ 
ity that it cannot explain. If we can explain some of this ‘unexplained’ variance (SS R ) 
in terms of other variables (covariates), then we reduce the error variance, allowing 
us to more accurately assess the effect of the independent variable (SS M ). 



1 In my teens I stalked many bands, and Iron Maiden are by far the nicest of the bands I’ve met. 

2 Apart from my wedding day as it turned out. 
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• Elimination of confounds: In any experiment, there may be unmeasured variables 
that confound the results (i.e., variables other than the experimental manipulation 
that affect the outcome variable). If any variables are known to influence the depen¬ 
dent variable being measured, then ANCOVA is ideally suited to remove the bias of 
these variables. Once a possible confounding variable has been identified, it can be 
measured and entered into the analysis as a covariate. 

There are other reasons for including covariates in ANOVA but because I do not intend to 
describe the computation of ANCOVA in any detail I recommend that the interested reader 
consult my favourite sources on the topic (Stevens, 2002; Wildt & Ahtola, 1978). 

Imagine that the researcher who conducted the Viagra study in the previous chapter 
suddenly realized that the libido of the participants’ sexual partners would affect the par¬ 
ticipants’ own libido (especially because the measure of libido was behavioural). Therefore, 
they repeated the study on a different set of participants, but this time took a measure 
of the partners’ libido. The partners’ libido was measured in terms of how often they 
tried to initiate sexual contact. In the previous chapter, we saw that this experimental 
scenario could be characterized in terms of equation (10.2). Think back to what we know 
about multiple regression (Chapter 7) and you can hopefully see that this equation can be 
extended to include this covariate as follows: 


libido,- = b 0 + hjcovariate, + b 2 high,- + bjlow, +£,- 
libido,- = b 0 + h 3 partner’s libido, + b 2 high,. + hjlow, +£,. 


( 11 . 1 ) 


11.3. Assumptions and issues in ANCOVA ® 


ANCOVA has the same assumptions as ANOVA except that there are two important addi¬ 
tional considerations: (1) independence of the covariate and treatment effect, and (2) 
homogeneity of regression slopes. 


11 . 3 . 1 . 


Independence of the covariate and treatment effect 


I said in the previous section that one use of ANCOVA is to reduce within-group error vari¬ 
ance by allowing the covariate to explain some of this error variance. However, for this to 
be true the covariate must be independent of the experimental effect. 

Figure 11.2 shows three different scenarios. Part A shows a basic ANOVA and is similar 
to Figure 10.4; it shows that the experimental effect (in our example, libido) can be parti¬ 
tioned into two parts that represent the experimental or treatment effect (in this case the 
administration of Viagra) and the error or unexplained variance (i.e., factors that affect 
libido that we haven’t measured). Part B shows the ideal scenario for ANCOVA in which 
the covariate shares its variance only with the bit of libido that is currently unexplained. 
In other words, it is completely independent of the treatment effect (it does not overlap 
with the effect of Viagra at all). This scenario is the only one in which ANCOVA is appro¬ 
priate. Part C shows a situation in which people often use ANCOVA when they should 
not. In this situation the effect of the covariate overlaps with the experimental effect. In 
other words, the experimental effect is confounded with the effect of the covariate. In this 
situation, the covariate will reduce (statistically speaking) the experimental effect because 
it explains some of the variance that would otherwise be attributable to the experiment. 
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FIGURE 11.2 

The role of the 
covariate in 
ANCOVA (see text 
for details) 
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When the covariate and the experimental effect (independent variable) are not independ¬ 
ent, the treatment effect is obscured, spurious treatment effects can arise and at the very 
least the interpretation of the ANCOVA is seriously compromised (Wildt & Ahtola, 1978). 

The problem of the covariate and treatment sharing variance is common and is ignored 
or misunderstood by many people (Miller &C Chapman, 2001). In a very readable review, 
Miller and Chapman cite many situations in which people misapply ANCOVA, and I rec¬ 
ommend reading this paper. To summarize the main issue, when treatment groups differ 
on the covariate, putting the covariate into the analysis will not ‘control for’ or ‘balance 
out’ those differences (Lord, 1967, 1969). This situation arises mostly when participants 
are not randomly assigned to experimental treatment conditions. For example, anxiety and 
depression are closely correlated (anxious people tend to be depressed) so if you wanted 
to compare an anxious group of people against a non-anxious group on some task, the 
chances are that the anxious group would also be more depressed than the non-anxious 
group. You might think that by adding depression as a covariate into the analysis you can 
look at the ‘pure’ effect of anxiety, but you can’t. This would be the situation in part C 
of Figure 11.2; the effect of the covariate (depression) would contain some of the vari¬ 
ance from the effect of anxiety. Statistically speaking all that we know is that anxiety and 
depression share variance; we cannot separate this shared variance into ‘anxiety variance’ 
and ‘depression variance’, it will always just be ‘shared’. Another common example is if 
you happen to find that your experimental groups differ in their ages. Placing age into the 
analysis as a covariate will not solve this problem - it is still confounded with the experi¬ 
mental manipulation. ANCOVA is not a magic solution to this problem. 
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This problem can be avoided by randomizing participants to experimental groups, or by 
matching experimental groups on the covariate (in our anxiety example, you could try to 
find participants for the low anxious group who score high on depression). We can check 
whether this problem is likely to be an issue by checking whether experimental groups dif¬ 
fer on the covariate before we run the ANCOVA. To use our anxiety example again, we 
could test whether our high and low anxious groups differ on levels of depression (with a 
t-test or ANOVA). If the groups do not significantly differ then we can use depression as 
a covariate. 


11 . 3 . 2 . 


Homogeneity of regression slopes (D 


When an ANCOVA is conducted we look at the overall relationship between the outcome 
(dependent variable) and the covariate: we fit a regression line to the entire data set, ignor¬ 
ing to which group a person belongs. In fitting this overall model we, therefore, assume 
that this overall relationship is true for all groups of participants. For example, if there’s a 
positive relationship between the covariate and the outcome in one group, we assume that 
there is a positive relationship in all of the other groups too. If, however, the relationship 
between the outcome (dependent variable) and covariate differs across the groups then 
the overall regression model is inaccurate (it does not represent all of the groups). This 
assumption is very important and is called the assumption of homogeneity of regression 
slopes. The best way to think of this assumption is to imagine plotting a scatterplot for each 
experimental condition with the covariate on one axis and the outcome on the other. If 
you then calculated, and drew, the regression line for each of these scatterplots you should 
find that the regression lines look more or less the same (i.e., the values of b in each group 
should be equal). 

Let’s try to make this concept a bit more concrete. The main example in this chapter 
leads on from the example in the previous chapter in which we explored whether different 
doses of Viagra affect libido. Imagine that we repeated this experiment, but measured part¬ 
ner’s libido as well and wanted to include this variable as a covariate. The homogeneity of 
regression slopes assumption means that we assume that the relationship between the out¬ 
come (dependent variable) and the covariate is the same in each of our treatment groups. 
Figure 11.3 shows a scatterplot that displays this relationship (i.e., the relationship between 
partner’s libido, the covariate, and participant’s libido, the outcome) for each of the three 
experimental conditions. Each symbol represents the data from a particular participant, 
and the type of symbol tells us the group (circles = placebo, triangles = low dose, squares = 
high dose). The lines are the regression slopes for the particular group; they summarize the 
relationship between libido and partner’s libido shown by the dots (black = placebo group, 
light blue = low-dose group, dark blue = high-dose group). 

It should be clear that there is a positive relationship (the regression line slopes upwards 
from left to right) between partner’s libido and participant’s libido in both the placebo 
and low-dose conditions. In fact, the slopes of the lines for these two groups (black and 
light blue) are very similar, showing that the relationship between libido and partner’s 
libido is very similar in these two groups. This situation is an example of homogeneity 
of regression slopes (the regression slopes in the two groups are similar). Flowever, in 
the high-dose condition there appears to be no relationship at all between participant’s 
libido and that of their partner (the squares are fairly randomly scattered and the regres¬ 
sion line is very flat and shows a slightly negative relationship). The slope of this line is 
very different from the other two, and this difference gives us cause to doubt whether 
there is homogeneity of regression slopes (because the relationship between participant’s 
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Scatterplot and 
regression lines 
of libido against 
partner's libido 
for each of the 
experimental 
conditions 


libido and that of their partner is different in the high-dose group than in the other two 
groups). 

Although, in a traditional ANCOVA, heterogeneity of regression slopes is a bad thing, 
there are situations where you might actually expect regression slopes to differ across 
groups and that this is, in itself, an interesting hypothesis. When research is conducted 
across different locations, you might reasonably expect the effects you get to differ slightly 
across those locations. For example, if you had a new treatment for backache, you might 
get several physiotherapists to try it out in different hospitals. You might expect the effect 
of the treatment to differ across these hospitals (because therapists will differ in expertise, 
the patients they see will have different problems and so on). Heterogeneity of regres¬ 
sion slopes is not a bad thing per se. If you have violated the assumption of homogeneity 
of regression slopes, or if the variability in regression slopes is an interesting hypothesis 
in itself, then you can explicitly model this variation using multilevel linear models (see 
Chapter 19). 


11.4. ANCOVA using R © 


In the previous section I said that we would develop the example from the previous chapter 
(which looked at the effect of Viagra on libido), but covary the effect of partner’s libido. 
Let’s now look at the data and run the analysis. 


11 . 4 . 1 . 


Packages for ANCOVA in R © 


In this chapter, you will need the packages car (for Levene’s test, Type III sums of squares), 
compute.es (for effect sizes), effects (for adjusted means), ggplotl (for graphs), multcomp 
(for post hoc tests), pastecs (for descriptive statistics), and WRS (for robust tests). If you do 
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not have these packages installed (some should be installed from previous chapters), you 
can install them by executing the following commands: 

install.packagesC'car"); install.packages("compute.es"); install.packages 
("effects");install.packages("ggplot2");install.packages("multcomp");install. 
packages("pastecs"); install.packages("WRS", repos="http://R-Forge.R-project. 
org") 

You then need to load these packages by executing these commands: 

library(car); library(compute.es); library(effects); library(ggplot2); 
library(multcomp); library(pastecs); library(WRS) 


11 . 4 . 2 . 


General procedure for ANCOVA © 


To conduct ANCOVA you should follow this general procedure: 



1 Enter data: I’m stating the obvious again. 

2 Explore your data: begin by graphing your data and computing some descriptive 
statistics. You should also check distributional assumptions and use Levene’s test to 
check for homogeneity of variance (see Chapter 5). 

3 Check that the covariate and any independent variables are independent: you need 
to run an ANOVA with the covariate as the outcome and any independent variables 
as predictors to check that the covariate does not differ significantly across levels of 
these variables. If you get a significant result then stop the analysis here. You have 
basically entered a bottomless pit of despair from which there is no escape. 

4 Do the ANCOVA: assuming all was fine in steps 2 and 3, run the main analysis of 
covariance. Depending on what you found with step 2, you might need to run a 
robust version of the test. 

5 Compute contrasts or post hoc tests: you can try to follow up the analysis to see 
which groups differ. 

6 Check for homogeneity of regression slopes: rerun the ANCOVA, including the inter¬ 
action between the independent variable and the covariate. If this interaction is sig¬ 
nificant then you cannot assume homogeneity of regression slopes. 


We will work through these steps in turn. 


11 . 4 . 3 . 


Entering data © 


The data for the main example are in Table 11.1 and can be found in the file ViagraCovariate. 
dat. Table 11.1 shows the participant’s libido and their partner’s libido, and Table 11.2 
shows the means and standard deviations of these data. You can load this data file by setting 
your working directory to the location of the file (see section 3.4.4) and executing: 

viagraData<-read.delim("ViagraCovariate.dat", header = TRUE) 

In essence, if you’re entering the data in an external package such as Excel then the 
data should be laid out more or less as they are in Table 11.1. So, create a coding variable 
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called Dose and, as in Chapter 10, let’s use 1 = placebo, 2 = low dose, 3 = high dose. There 
were different numbers of participants in each condition, so you need to enter nine values 
of 1 into this column (so that the first nine rows contain the value 1), followed by eight 
rows containing the value 2, followed by 13 rows containing the value 3. At this point, 
you should have one column with 30 rows of data entered. Next, create a second variable 
called libido and enter the 30 scores that correspond to the person’s libido. Finally, cre¬ 
ate a third variable called partnerLibido. Then, enter the 30 scores that correspond to the 
partner’s libido. 


TABLE 11.1 Data from ViagraCovariate.dat 
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We could enter the data directly into R by executing the following code: 


libido<-cCB,2,5,2,2,2,7,2,4,7,5,3>4,4,7,5,4,9,2,6,3>4,4,4,6,4,6,2,8,5) 
partnerLibido<-c(4,l,5,l,2,2,7,4,5,5,3,l,2,2,6,4,2,l,3,5, 
4,3,3,2,0,1,3,0,1,0) 
dose<-c(rep(l,9),rep(2,8), rep(3,13)) 


These commands create a variable called libido with the 30 libido scores contained within 
it, a variable called partnerLibido containing the libido scores for the corresponding part¬ 
ners, and a variable called dose which uses the rep() function to repeat the number 1 nine 
times, the number 2 eight times and the number 3 thirteen times (see the data below). We 
need to convert the numeric variable dose into a factor (i.e., categorical variable) and we 
can do this, as we did in the last chapter, by executing: 

dose<-factor(dose, levels = c(l:3), labels = c("Placebo", "Low Dose", "High 
Dose")) 

Remember that we have specified that the levels of dose are 1, 2 and 3 (levels = c(l:3)), 
and that we want to label these levels as Placebo, Low Dose and High Dose (labels = 
c(“Placebo”, “Low Dose”, “High Dose”)). Finally, we can merge these variables into a 
dataframe called viagraData by executing: 

viagraData<-data. frame(dose, libido, partnerLibido) 

The resulting data look like this: 




dose 

libido 

partnerLibido 

1 

Placebo 

3 

4 

2 

Placebo 

2 

1 

3 

Placebo 

5 

5 

4 

Placebo 

2 

1 

5 

Placebo 

2 

2 

6 

Placebo 

2 

2 

7 

Placebo 

7 

7 

8 

Placebo 

2 

4 

9 

Placebo 

4 

5 

10 

Low 

Dose 

7 

5 

11 

Low 

Dose 

5 

3 

12 

Low 

Dose 

3 

1 

13 

Low 

Dose 

4 

2 

14 

Low 

Dose 

4 

2 

15 

Low 

Dose 

7 

6 

16 

Low 

Dose 

5 

4 

17 

Low 

Dose 

4 

2 

18 

High 

Dose 

9 

1 

19 

High 

Dose 

2 

3 

20 

High 

Dose 

6 

5 

21 

High 

Dose 

3 

4 

22 

High 

Dose 

4 

3 

23 

High 

Dose 

4 

3 

24 

High 

Dose 

4 

2 

25 

High 

Dose 

6 

0 

26 

High 

Dose 

4 

1 

27 

High 

Dose 

6 

3 

28 

High 

Dose 

2 

0 

29 

High 

Dose 

8 

1 

30 

High 

Dose 

5 

0 
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SELF-TEST 

s Use R to find out the means and standard 
deviations of both the participant’s libido and the 
partner’s libido in the three groups. (Answers are in 
Table 11.2.) 


Table 11.2 Means (and standard deviations) from ViagraCovariate.dat 




11 . 4 . 4 . 


ANCOVA using R Commander © 


There is no menu in R Commander that relates directly to ANCOVA. However, because 
ANCOVA is simply regression, you could theoretically run it through the Statistics=>Fit 
models=>Linear regression... menu. However, I don’t recommend using R Commander for 
ANCOVA because it doesn’t deal very well with categorical predictors, and you can’t con¬ 
trol the order in which variables are entered (which is pretty important as we shall see). For 
these reasons I’m going to force you to use commands in this chapter. You could, however, 
use R Commander for some of the preliminary analyses; if you want to do this then see the 
previous chapter (section 10.6.4). 


11 . 4 . 5 . 


Exploring the data © 


We’ll begin with some graphs. To look at the spread of data it’s useful to look at boxplots 
for each group both for libido and partner’s libido. In addition, it is helpful to look at the 
relationship between the outcome variable and the covariate within each group (this tells 
us about homogeneity of regression slopes). In this section, we’ll look at some boxplots. 



SELF-TEST 

s Use ggplot2 to produce boxplots for the Viagra data. 
Try to re-create Figure 11.4. 



Figure 11.4 shows boxplots for the levels of libido in both participants and their part¬ 
ners across the three doses of Viagra. Levels of libido seem to increase for participants as 
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FIGURE 11.4 
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the dose of Viagra increases but the opposite is true for their partners. Also, the spread of 
scores is more variable for the participants than their partners. 

If you completed the earlier self-test, then you will already have some descriptive sta¬ 
tistics for the data; if not, we can use the by() function and combine it with the stat.desc() 
function in the pastecs package to get descriptive statistics for each group separately (see 
Chapter 5 for more detail). Execute this command for both libido and partner Libido: 

by(viagraData$libido, viagraData$dose, stat.desc) 
by(viagraData$pcirtnerLibido, viagraData$dose, stat.desc) 

The resulting output should confirm the means and standard deviations in Table 11.2 
(amongst other things). 

The final thing to do at this stage is to compute Levene’s test (see Chapter 5 and sec¬ 
tion 10.3.1). We encountered the leveneTest() function from the car package in Chapter 5, 
and we can again use it here. If we want to do Levene’s test to see whether the variance in 
libido (the outcome) varies across groups that received different doses of the drug (dose), 
we can execute: 

leveneTest(viagraData$libido, viagraData$dose, center = median) 

The output (Output 11.1) shows that Levene’s test is very non-significant, F( 2, 27) = 
0.33, p = .72. This means that for these data the variances are very similar (hence the high 
probability value). Had this test been significant, we could instead conduct and report a 
robust version of ANOVA, which we’ll cover later in this chapter. 

Levene's Test for Homogeneity of Variance 
Df F value Pr(>F) 
group 2 0.3256 0.7249 

27 

Output 11.1 

A good double-check of Levene’s test is to look at the highest and lowest variances. For 
our three groups we have standard deviations of 1.79 (placebo), 1.46 (low dose) and 2.12 
(high dose) - see Table 11.1. If we square these values we get variances of 3.20 (placebo), 
2.13 (low dose) and 4.49 (high dose). We then take the largest variance and divide it by the 
smallest: in this case 4.49/2.13 = 2.11. If we look at Figure 5.8 we can get the approximate 
critical value when comparing three variances and with 10 people per group (we have 
unequal groups, but this will do as an approximation). The critical value in this situation 
is approximately 5. Our observed value of 2.11 is less than this critical value of 5 so we 
probably don’t need to worry too much about the differences in variances. 
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11 . 4 . 6 . 


Are the predictor variable and covariate 
independent? © 


In section 11.3.1, I mentioned that before including a covariate in an analysis we should 
check that it is independent of the experimental manipulation. In this case, the proposed 
covariate is partner’s libido, and we need to check that this variable was roughly equal 
across levels of our independent variable. In other words, is the mean level of partner’s 
libido roughly equal across our three Viagra groups? We can test this by running an ANOVA 
with partnerLibido as the outcome and dose as the predictor. 



SELF-TEST 

V Conduct an ANOVA to test whether partner’s libido 
(our covariate) is independent of the dose of Viagra 
(our independent variable). 



Output 11.2 shows the results of such an ANOVA. The main effect of dose is not signifi¬ 
cant, F( 2, 27) = 1.98, p = .16, which shows that the average level of partner’s libido was 
roughly the same in the three Viagra groups. In other words, the means for partner’s libido 
in Table 11.2 are not significantly different in the placebo, low- and high-dose groups. This 
result means that it is appropriate to use partner’s libido as a covariate in the analysis. 

Df Sum Sq Mean Sq F value Pr(>F) 
dose 2 12.769 6.3847 1.9793 0.1577 

Residuals 27 87.097 3.2258 

Output 11.2 


11 . 4 . 7 . 


Fitting an ANCOVA model © 


To create an ANCOVA model we can use the aov() function that we discovered in the 
previous chapter (see section 10.6.6.1). Remember that the aov() function is just the lm() 
function in disguise, so we can use what we learnt in Chapter 7 to add new variables into 
our ANOVA model. Remember that to add a predictor, we simply write ‘+ variableName’ 
into the model. So, in Chapter 10 our ANOVA model was: 

viagraModel<-aov(libido ~ dose, data = viagraData) 

To add the predictor partnerLibido, we could simply change the model to this: 

viagraModel<-aov(libido ~ dose + partnerLibido, data = viagraData) 

Note that we have simply added ‘+ partnerLibido’ to the list of predictors. In essence, this 
is all there is to it. We could simply execute this command, sit back, crack open a cool drink 
and admire our handiwork. However, just as we were starting to enjoy a wave of smugness 
at having conducted an ANCOVA, the sinister shadow of humility would slap us on the 
face and point out that we need to think about the order of our predictors. If we use 
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the aov() function alone then we’ll get different results if we specify our model as ‘libido ~ 
dose + partnerLibido’ than if we specify ‘libido ~ partnerLibido + dose’ (note the order of 
predictors). This is curious and is something to which we need to give some thought (see 
R’s Souls’ Tip 11.1). 



Order matters © 


The order in which we enter predictors into a model makes a difference to the effects in the overall ANOVA-which 
is very confusing. Luckily it does not affect the model parameters (i.e., the bs). Let’s look at an example. 

First, let’s fit the ANCOVA model with partnerLibido entered first and then dose. To create this model (called 
covariateFirst), we specify the model as libido ~ partnerLibido + dose. We can see the ANOVA table for this model 
by executing the following commands: 


covariateFirst<-aov(libido ~ partnerLibido + dose, data = viagraData) 
summary(covariateFirst) 

The resulting ANOVA table is: 


Df 

partnerLibido 1 
dose 2 
Residuals 26 


Sum Sq Mean Sq 
6.734 6.7344 

25.185 12.5926 
79.047 3.0403 


F value 
2.2150 
4.1419 


Pr(>F) 

0.14870 
0.02745 * 


This model implies a non-significant effect of the covariate (partnerLibido) on the participant’s libido, but a 
significant effect of dose. 

Let’s now redo the model but specifying the predictors in the opposite order. To create this model (called 
doseFirst), we specify the model as libido ~ dose + partnerLibido. Note that all we have done is change the order 
of the predictors. We can see the ANOVA table for this model by executing the following commands: 


doseFirst<-aov(libido ~ dose + partnerLibido, data = viagraData) 
summary(doseFirst) 

The resulting ANOVA table is: 


Df 

dose 2 
partnerLibido 1 
Residuals 26 


Sum Sq Mean Sq F value 
16.844 8.4219 2.7701 
15.076 15.0757 4.9587 
79.047 3.0403 


Pr(>F) 
0.08117 
0.03483 


This model implies the complete opposite of the previous one: a significant effect of the covariate (partnerLibido) 
on the participant's libido, but a nonsignificant effect of dose. 

This is strange, isn’t it? The reason is that when R computes the fit of the model it uses Type I, or sequential, 
sums of squares by default. This means that any predictor entered into the model is evaluated after predictors 
before it in the model. Hence, order matters: in our first model partnerLibido is evaluated as the only term in the 
model, whereas in the second model it is evaluated after dose has already been entered and evaluated. 

An alternative (adopted by many statistics packages) is to use Type III sums of squares. For our first model 
(covariateFirst ), we could get the Type III sums of squares by executing (see text for details): 

AnovaCcovariateFirst, type = "III") 

For our second model (doseFirst), we could execute: 

Anova(doseFirst, type = "III") 
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The outputs for both models are below: 

libido ~ partnerLibido + dose 
Df Sum of Sg 

<none> 79 

partnerLibido 1 15.076 94 

dose 2 25.185 104 


RSS 

AIC 

F value 

Pr (F) 

047 

37.065 



123 

40.302 

4.9587 

0.03483 * 

232 

41.363 

4.1419 

0.02745 * 


Model: libido ~ dose + partnerLibido 

Df Sum of Sg RSS AIC F value 

<none> 79.047 37.065 

dose 2 25.185 104.232 41.363 4.1419 

partnerLibido 1 15.076 94.123 40.302 4.9587 


Pr (F) 

0.02745 * 
0.03483 * 


Note that even though the predictors have been entered in the opposite order the results are now consistent in 
the two models. 


Before we get carried away creating our ANCOVA model, we need to think about two 
related questions. First, how should we compute the sums of squares? Second, which con¬ 
trasts do we want to do? The answer to the second question depends, to some extent, on 
the answer to the first. The first issue is complex. Essentially we have the choice between 
evaluating our model using Type I, II or III sums of squares. For an explanation of the differ¬ 
ence between these sums of squares and their relative merits, see Jane Superbrain Box 11.1. 



JANE SUPERBRAIN 11.1 

Types of sums of squares (D 

We can compute sums of squares in four different ways, 
which gives rise to what are known as Type I, II, III and IV 
sums of squares. To explain these, we need an example. 
Let’s imagine that we're predicting libido from partner¬ 
Libido (the covariate), dose (the independent variable) 
and their interaction (partnerLibido x dose). 

The simplest explanation of Type I sums of squares 
is that they are like doing a hierarchical regression in 
which we put one predictor into the model first, and then 
enter the second predictor. This second predictor will be 


evaluated after the first. If we entered a third predictor 
then this would be evaluated after the first and second, 
and so on. In other words the order that we enter the pre¬ 
dictors matters. Therefore, if we entered our variables in 
the order partnerLibido, dose and then partnerLibido 
x dose, then dose would be evaluated after the effect 
of partnerLibido and partnerLibido x dose would be 
evaluated after the effects of both partnerLibido and 
dose. R’s Souls’ Tip 11.1 demonstrates Type I sums of 
squares in more detail. 

Type III sums of squares differ from Type I in that all 
effects are evaluated taking into consideration all other 
effects In the model (not just the ones entered before). 
This process is comparable to doing a forced entry 
regression including the covariate(s) and predictor(s) 
in the same block. Therefore, in our example, the effect 
of dose would be evaluated after the effects of both 
partnerLibido and partnerLibido x dose, the effect of 
partnerLibido would be evaluated after the effects of both 
dose and partnerLibido x dose, finally, partnerLibido x 
dose would be evaluated after the effects of both dose 
and partnerLibido 


(Continued) 
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(Continued) 

Type II sums of squares are somewhere in between 
Type I and III in that all effects are evaluated taking into 
consideration all other effects in the model except for 
higher-order effects that include the effect being evaluated. 
In our example, this would mean that the effect of dose 
would be evaluated after the effect of partnerLibido (note 
that unlike Type III sums of squares, the interaction term 
is not considered); similarly, the effect of partnerLibido 
would be evaluated after only the effect of dose. Finally, 
because there is no higher-order interaction that includes 
partnerLibido x dose, this effect would be evaluated 
after the effects of both dose and partnerLibido. In other 
words, for the highest-order term Type II and Type III sums 
of squares are the same. Type IV sums of squares are 
essentially the same as Type III but are designed for situa¬ 
tions in which there are missing data. 

The obvious question is which type of sums of squares 
should you use: 

• Type I: Unless the variables are completely indepen¬ 
dent of each other (which is unlikely to be the case) then 
Type I sums of squares cannot really evaluate the true 
main effect of each variable. For example, if we enter 
partnerLibido first, its sums of squares are computed 
ignoring dose; therefore any variance in libido that is 
shared by dose and partnerLibido will be attributed 
to partnerLibido (i.e., variance that it shares with dose 
is attributed solely to it). The sums of squares for dose 
will then be computed excluding any variance that has 
already been ‘given over’ to partnerLibido. As such the 
sums of squares won’t reflect the true effect of dose 
because variance in libido that dose shares with part¬ 
nerLibido is not attributed to it because it has already 
been ‘assigned’ to partnerLibido. Consequently, Type I 
sums of squares tend not to be used to evaluate hypoth¬ 
eses about main effects and interactions because the 
order of predictors will affect the results. 

• Type II: If you’re interested in main effects then you 
should use Type II sums of squares. Unlike Type III 
sums of squares, Type I Is give you an accurate picture 
of a main effect because they are evaluated ignoring 


the effect of any interactions involving the main effect 
under consideration. Therefore, variance from a main 
effect is not ‘lost’ to any interaction terms containing 
that effect. If you are interested in main effects and do 
not predict an interaction between your main effects 
then these tests will be the most powerful. Flowever, if 
an interaction is present , then Type II sums of squares 
cannot reasonably evaluate main effects (because 
variance from the interaction term is attributed to 
them). However, if there is an interaction then you 
shouldn’t really be interested in main effects anyway. 
One advantage of Type II sums of squares is that they 
are not affected by the type of contrast coding used to 
specify the predictor variables. 

• Type III: Type III sums of squares tend to get used as 
the default in many statistical packages. They have the 
advantage over Type Ms that when an interaction is pres¬ 
ent, the main effects associated with that interaction are 
still meaningful (because they are computed taking the 
interaction into account). Perversely, this advantage is 
a disadvantage too because it’s pretty silly to entertain 
‘main effects’ as meaningful in the presence of an inter¬ 
action. Type III sums of squares encourage people to do 
daft things like get excited about main effects that are 
superseded by a higher-order interaction. Type III sums 
of squares are preferable to other types when sample 
sizes are unequal; however, they work only when pre¬ 
dictors are encoded with orthogonal contrasts. 

Hopefully, it should be clear that the main choice in 
ANOVA designs is between Type II and Type III sums of 
squares. The choice depends on your hypotheses and 
which effects are important in your particular situation. If 
your main hypothesis is around the highest-order interac¬ 
tion then it doesn't matter which you choose (you’ll get the 
same results); if you don’t predict an interaction and are 
interested in main effects then Type II will be most power¬ 
ful; and if you have an unbalanced design then use Type 
III. This advice is, of course, a simplified version of reality; 
be aware that there is (often heated) debate about which 
sums of squares are appropriate to a given situation. 


If we want Type I sums of squares, then in ANCOVA we enter the covariate(s) first, and 
the independent variable(s) second. So, we would need to specify the model not as we did 
above, but as: 

viagraModel<-aov(libido ~ partnerLibido + dose, data = viagraData) 

Note that the order of predictors in the model is the covariate (partnerLibido) followed 
by the independent variable (dose), which means that the effect of dose is evaluated after 
the effect of partnerLibido. If we specify the predictors in the opposite order we could get 
completely different results (R’s Souls’ Tip 11.1). 
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We can get Type II and III sums of squares by using the Anova() function in the car pack¬ 
age. 3 This function takes the general form: 

Anova(modelName, type = "III") 

Note that the function needs a capital letter at the beginning (otherwise you’ll use a func¬ 
tion that does something different than what you want), and we replace modelName with 
the name of the model for which we want Type III sums of squares. The type option 
defaults to type = “II” (Type II sums of squares), but we can change it to type = “III” to get 
Type III sums of squares. 

The second question we asked was about which contrasts to select. This issue goes back 
to our discussion of planned comparisons in Chapter 10. By default, R will use dummy 
coding on the dose variable (it will compare each group to the first group). This is a non- 
orthogonal contrast. If we want to use an orthogonal contrast such as a Helmert contrast 
or set our own contrast then we need to use the contrast() function to set the contrast for 
dose before we create the model (see section 10.6.7). The reason why the answer to this 
question depends on which sums of squares we use is because to calculate Type III sums of 
squares properly we must specify orthogonal contrasts. By default R will use a non-orthog- 
onal contrast (dummy coding), therefore, if we do not change the contrast or contrasts to 
be orthogonal the Type III sums of squares computed will be wrong. We must, therefore, 
either set a Helmert contrast by executing: 

contrasts(viagraData$dose)<-contr.helmert(3) 

or set our own contrast codes as we did in section 10.4. To remind you, we chose some 
planned contrasts in Chapter 10, in which the first contrast compared the placebo group to 
all doses of Viagra, and the second contrast then compared the high and low doses (see sec¬ 
tion 10.4). We saw in sections 10.4 and 10.6.7 that to do this in R we had to enter certain 
numbers to code these contrasts. For the first contrast we discovered an appropriate set of 
codes would be —2 for the placebo group and then 1 for both the high- and low-dose groups. 
For the second contrast the codes would be 0 for the placebo group, —1 for the low-dose 
group and 1 for the high-dose group (see Table 10.4). If you want to do these contrasts for 
ANCOVA, then you enter these codes into the contrasts() function for dose just as we did in 
section 10.6.7: 

contrasts(viagraData$dose)<-cbind(c(-2,1,1), c(0,-1,1)) 

We will use these contrasts; therefore, to run the ANCOVA (with Type III sums of 
squares) we would execute: 

contrasts(viagraData$dose)<-cbind(c(-2,l,l), c(0,-1,1)) 
viagraModel<-aov(libido ~ partnerLibido + dose, data = viagraData) 
AnovaCviagraModel, type="III") 

The first line sets the contrasts for dose, the second line creates the ANCOVA model, and 
the third line prints the model summary with Type III sums of squares. 


11 . 4 . 8 . 


Interpreting the main ANCOVA model (D 


Output 11.3 shows the main ANCOVA. Looking first at the significance values, it is clear 
that the covariate significantly predicts the dependent variable, because the significance 
value is less than .05. Therefore, the person’s libido is influenced by their partner’s libido. 


3 You can also use the dropl() function to get Type III sums of squares. For ANOVA and ANCOVA, this takes the form: 
dropl(modelName, test="F M ) 
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What’s more interesting is that when the effect of partner’s libido is removed, the effect of 
Viagra is significant (p is .027, which is less than .05). 

Anova Table (Type III tests) 


Response: lib 

ido 




Sum Sq 

Df 

F value Pr(>F) 

(Intercept) 

76.069 

1 

25.0205 3.342e-05 

partnerLibido 

15.076 

1 

4.9587 0.03483 

dose 

25.185 

2 

4.1419 0.02745 

Residuals 

79.047 

26 


Output 11.3 





Looking back at the group means from Table 11.2 for the libido data, it seems pretty 
clear that the significant ANOVA reflects a difference between the placebo group and the 
two experimental groups (because the low- and high-dose groups have very similar means, 
4.88 and 4.85, whereas the placebo group mean is much lower at 3.22). Actually we can’t 
interpret these group means because they have not been adjusted for the effect of the cov¬ 
ariate. These original means tell us nothing about the group differences reflected by the 
significant ANCOVA. To get the adjusted means we need to use the effect() function in the 
effects package. This produces a summary table of means for a specified effect in a model 
created by aov() or lm(), but adjusted for other variables in the model (so called marginal 
means). The function takes the general form: 

object<-effect("name of effect", modelName, se=TRUE) 

summary(object) 

object$se 



Note that we create an object that contains information about a given effect. 
The “name of effect” should be replaced with the effect in the model that interests 
you (in the current example we want the effect of dose). We also have to tell the 
function the name of the model (so we would replace modelName with the name 
of the ANCOVA model, in this case, viagraModel). Finally, if we want to see the 
standard errors associated with each mean, then we need to include the option 
se=TRUE. 

The effect object we created with this command contains various bits of informa¬ 
tion, but to print the adjusted means and confidence intervals we can just apply the 
summary() function to the newly created object. The standard errors are stored as 
a variable called se within the effect object; therefore, to see the standard errors we need 
to execute object$se. 

To put all of this into practice for the Viagra data, to see the adjusted means we should 
execute: 


adjustedMeans<-effect("dose", viagraModel, se=TRUE) 

summary(adjustedMeans) 

adjustedMeans$se 

Output 11.4 shows the adjusted means (and their confidence intervals) and also the stand¬ 
ard errors. Unlike the means in Table 11.2, these adjusted means for the low-dose and 
high-dose groups are fairly different. In other words, when the means are adjusted for 
the effect of the covariate it looks very much like as dose increases, libido increases (from 
2.93 in the placebo group, to 4.71 in the low-dose group and 5.15 in the high-dose 
group). The standard errors for each group appear after the adjustedMeans$se com¬ 
mand: 0.59 for the placebo group, 0.62 for the low-dose group and 0.50 for the high- 
dose group. 
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dose effect 
dose 

Placebo Low Dose High Dose 
2.926370 4.712050 5.151251 


Lower 95 
dose 

Placebo 

1.700854 


Percent Confidence Limits 

Low Dose High Dose 
3.435984 4.118076 


Upper 95 Percent Confidence Limits 
dose 

Placebo Low Dose High Dose 
4.151886 5.988117 6.184427 

> adjustedMeans$se 

31 32 33 

0.5962045 0.6207971 0.5026323 

Output 11.4 
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Planned contrasts in ANCOVA © 


The overall ANCOVA does not tell us which means differ, so to break down the overall 
effect of dose we need to look at the contrasts that we specified before we created the 
ANCOVA model. To see these contrasts we can use the summary.lmQ function on the 
ANCOVA model ( viagraModel ): 

summary.ImCviagraModel) 

Output 11.5 shows the model parameters, which correspond to the contrasts that we 
specified for the variable dose. The first dummy variable (dosel) compares the placebo 
group with the low- and high-dose groups. As such, it compares the adjusted mean of the 
placebo group (2.93) with the average of the adjusted means for the low- and high-dose 
groups ((4.71+5.15)/2 = 4.93). The b-value for the first dummy variable should therefore 
be the difference between these values: 4.93—2.93 = 2. However, we also discovered in a 
rather complex and boring bit of section 10.4.2 that this value gets divided by the number 
of groups within the contrast (i.e., 3) and so will be 2/3 = .67 (as it is in the output). The 
associated t-statistic is significant, indicating that the placebo group was significantly dif¬ 
ferent from the combined mean of the Viagra groups. 

The second dummy variable (dose2) compares the low- and high-dose groups, and so the 
b-value should be the difference between the adjusted means of these groups: 5.15—4.71 = 
0.44. We again discovered in section 10.4.2 that this value also gets divided by the number of 
groups within the contrast (i.e., 2) and so will be 0.44/2 = 0.22 (as in the output). The associ¬ 
ated t-statistic is not significant (its significance is .59 which is greater than .05), indicating that 
the high-dose group did not produce a significantly higher libido than the low-dose group. 

The final thing to notice is the value of b for the covariate (0.416). This value tells us 
that, other things being equal, if a partner’s libido increases by one unit, then the person’s 
libido should increase by just under half a unit (although there is nothing to suggest a causal 
link between the two). The sign of this coefficient tells us the direction of the relationship 
between the covariate and the outcome. So, in this example, because the coefficient is posi¬ 
tive it means that partner’s libido has a positive relationship with the participant’s libido: 
as one increases so does the other. A negative coefficient would mean the opposite: as one 
increases, the other decreases. 
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Coefficients: 



Estimate 

Std. Error 

t value 

Pr(>|t|) 

(Intercept) 

3.1260 

0.6250 

5.002 

3.34e-05 

partnerLibido 

0.4160 

0.1868 

2.227 

0.03483 

dosel 

0.6684 

0.2400 

2.785 

0.00985 

dose2 

0.2196 

0.4056 

0.541 

0.59284 


Output 11.5 


* * * 
* 


* * 


11 . 4 . 10 . 


Interpreting the covariate (D 



I’ve already mentioned that the parameter estimates tell us how to interpret the covariate. 
If the 6-value for the covariate is positive then it means that the covariate and the outcome 
variable have a positive relationship (as the covariate increases, so does the outcome). If 
the 6-value is negative it means the opposite: that the covariate and the outcome variable 
have a negative relationship (as the covariate increases, the outcome decreases). For these 
data the 6-value was positive, indicating that as the partner’s libido increases, so does the 
participant’s libido. Another way to discover the same thing is simply to draw a scatterplot 
of the covariate against the outcome. 



SELF-TEST 

s Plot a scatterplot of partnerLibido against libido. 


FIGURE 11.5 

Scatterplot of 
partner’s libido 
against libido 


9- • 

8 - • 



2 - • • • • • 

6 1 2 3 4 5 6 7 

Partner’s Libido 


Figure 11.5 shows the resulting scatterplot for these data and confirms what we already 
know: the effect of the covariate is that as partner’s libido increases, so does the partici¬ 
pant’s libido (as shown by the slope of the regression line). 
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Post hoc tests in ANCOVA <D 


It is also possible to obtain post hoc tests as we did for ANOVA (see section 10.6.8). 
However, because we want to test differences between the adjusted means, we can use only 
the glht() function; the pairwise.t.test() function will not test the adjusted means. As such, 
we are limited to using Tukey or Dunnett’s post hoc tests. Remember from Chapter 10 that 
to use this function we enter our model (in this case the ANCOVA model) into it and then 
use the summary() and confint() functions to see the post hoc tests in the console. For the 
viagraModel, we could therefore execute: 

postHocs<-glht(viagraModel, linfct = mcp(dose = "Tukey")) 

summary(postHocs) 

confint(postHocs) 

Output 11.6 shows the three comparisons (low dose vs. placebo, high dose vs. placebo, 
high dose vs. low dose). Note that the estimate in each case is the difference between 
the adjusted group means (Output 11.4): the estimate for the low dose vs. placebo is 
4.71 - 2.93 = 1.78; for high dose vs. placebo it is 5.15 - 2.93 = 2.22; and for the low vs. 
high is 5.15 - 4.71 = 0.44. The output also gives us the standard error associated with 
the difference between adjusted means, the t-test (which is simply the difference between 
means divided by the standard error), and its associated p-value. This output suggests sig¬ 
nificant differences between the high-dose and placebo groups (t = 2.77, p < .05), but not 
between the low-dose group and the placebo ( t = 2.10, p = .12), and high-dose ( t = 0.54, p = 
.85) groups. The confidence intervals (Output 11.7) also confirm this conclusion because 
they do not cross zero for the comparison of the high dose and placebo groups, which 
means that the true difference between group means is likely not to be zero; conversely, 
for the other contrasts the confidence intervals cross zero, implying that the true difference 
between means could be zero. 

Simultaneous Tests for General Linear Hypotheses 
Multiple Comparisons of Means: Tukey Contrasts 

Fit: aov(formula = libido ~ partnerLibido + dose, data = viagraData) 
Linear Hypotheses: 

Estimate Std. Error t value Pr(>|t|) 


Low Dose - 

Placebo == 0 

1.7857 

0.8494 

2.102 

0.1088 

High Dose 

- Placebo == 0 

2.2249 

0.8028 

2.771 

0.0264 * 

High Dose 

- Low Dose == 0 

0.4392 

0.8112 

0.541 

0.8516 


Signif. codes: 0 '***' 0.001 '**■ 0.01 0.05 0.1 ' ' 1 

(Adjusted p values reported -- single-step method) 

Output 11.6 


Simultaneous Confidence Intervals 
Multiple Comparisons of Means: Tukey Contrasts 

Fit: aov(formula = libido ~ partnerLibido + dose, data = viagraData) 

Quantile = 2.4856 

95% family-wise confidence level 
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Linear Hypotheses: 

Low Dose - Placebo == 
High Dose - Placebo = 
High Dose - Low Dose 


Estimate 
0 1.7857 

0 2.2249 

= 0 0.4392 


lwr upr 

-0.3255 3.8968 

0.2294 4.2204 

-1.5772 2.4556 


Output 11.7 


11 . 4 . 12 . 


Plots in ANCOVA © 


We saw in the previous chapter that the aov() function automatically generates some plots 
that we can use to test the assumptions. We can see these graphs by executing: 

plots(viagraModel) 

The results are in Figure 11.6. You will actually see four graphs, but the first two are the 
most important. The first graph (on the left of the figure) can be used for testing homoge¬ 
neity of variance. We encountered this kind of plot in Chapter 7: if it has a funnel shape 
then we’re in trouble. The plot we have does show funnelling (the spread of scores is wider 
at some points than at others), which implies that the residuals might be heteroscedastic 
(a bad thing). The second plot (on the right) is a Q-Q plot (see Chapter 5), which tells us 
about the normality of residuals in the model. We want our residuals to be normally dis¬ 
tributed, which means that the dots on the graph should hover around the diagonal line. 
On ours, it looks like the diagonal line has not washed for several weeks and the dots are 
running away from the smell. Again, this is not good news for the model. These plots sug¬ 
gest that a robust version of ANCOVA might be in order. 


FIGURE 11.6 

Plots of an 
ANCOVA model 


Residuals vs Fitted 



Normal Q-Q 



Theoretical Quantiles 
aov(libido ~ partnerLibido + dose) 


11 . 4 . 13 . 


Some final remarks © 


This example illustrates how ANCOVA can help us to exert stricter experimental control 
by taking account of confounding variables to give us a ‘purer’ measure of effect of the 
experimental manipulation. 










CHAPTER 11 ANALYSIS OF COVARIANCE, ANCOVA (GLM 2) 


483 



SELF-TEST 

s Run a one-way ANOVA to see whether the three 
groups differ in their levels of libido. 



Output 11.8 shows (for illustrative purposes) the ANOVA table for these data when the 
covariate is not included. It is clear from the significance value, which is greater than .05, 
that Viagra seems to have no significant effect on libido. Therefore, without taking account 
of the libido of the participants’ partners we would have concluded that Viagra had no 
significant effect on libido, yet it does. 

Df Sum Sq Mean Sq F value Pr(>F) 
dose 2 16.844 8.4219 2.4159 0.1083 

Residuals 27 94.123 3.4860 

Output 11.8 


11 . 4 . 14 . 


Testing for homogeneity of regression slopes (D 


We saw earlier in the chapter that the assumption of homogeneity of regression slopes means 
that the relationship between the covariate and outcome variable (in this case partnerLibido 
and libido) should be similar at different levels of the predictor variable (in this case in the 
three dose groups). Figure 11.3 showed scatterplots of the relationship between partnerLibido 
and libido in the three groups. This scatterplot showed that although this relationship was 
comparable in the low-dose and placebo groups, it appeared different in the high-dose group. 


SELF-TEST 

s Use ggplot2 to re-create Figure 11.3. 




To test the assumption of homogeneity of regression slopes we need to run the ANCOVA 
again, but include the interaction between the covariate and predictor variable. We can do this 
in three ways. The first is to re-specify the whole model from scratch. We can include interaction 
terms by linking variable names with a colon. For example, the interaction of partnerLibido 
and dose would be written in R as partnerLibido:dose (or indeed dose .-partnerLibido, it doesn’t 
matter). Therefore, to include this interaction in an ANCOVA model we could execute: 

hoRS<-aov(libido ~ partnerLibido + dose + dose:partnerLibido, data = 
viagraData) 

This command creates a model called hoRS (short for homogeneity of regression slopes), 
which includes the covariate, the independent variable and their interaction. 

The second way is to use the fact that you can include variables and their interactions in the 
same model by specifying variable Invariable! as the predictor. Doing so will enter not just the 
interaction but also the effects of the individual variables as well. So, for example, this command: 

hoRS<-aov(libido ~ partnerLibido*dose, data = viagraData) 

does exactly the same thing as the previous command. 








484 


DISCOVERING STATISTICS USING SPSS 


The final method is to update our original ANCOVA model ( viagraModel ) to include 
the interaction term using the update() function (see R’s Souls’ Tip 7.2). The viagraModel 
already includes partnerLibido and dose, so all we need to do is add the interaction term 
by including ‘+ dose:partnerLibido’ as follows: 

hoRS<-update(viagraModel, + partnerLibido:dose) 

The simply means ‘keep the same outcome variable and predictor as before’ and the 
‘+ partnerLibido: dose’ means ‘add the interaction term’. This method is, as you can see, 
the quickest. Execute one of these commands to create the hoRS object and then use the 
anova() function to get the Type III sums of squares by executing: 4 

Anova(hoRS, type=”III”) 

Output 11.9 shows the main summary table for the ANCOVA including the interaction 
term. The effects of the dose of Viagra and the partner’s libido are still significant, but the 
main thing in which we’re interested is the interaction term, so look at the significance 
value of the covariate by outcome interaction (partnerLibido:dose), if this effect is sig¬ 
nificant then the assumption of homogeneity of regression slopes has been broken. The 
effect here is significant (p < .05); therefore the assumption is not tenable. Although this 
finding is not surprising given the pattern of relationships shown in Figure 11.3, it does 
raise concern about the main analysis. This example illustrates why it is important to test 
assumptions and not to just blindly accept the results of an analysis. 

Anova Table (Type III tests) 

Response: libido 



Sum Sq 

Df 

F value 

Pr(>F) 


(Intercept) 

53.542 

1 

21.9207 

9.323e-05 

* * * 

partnerLibido 

17.182 

1 

7.0346 

0.013947 

* 

dose 

36.558 

2 

7.4836 

0.002980 

* * 

partnerLibido: 

:dose 20.427 

2 

4.1815 

0.027667 

* 

Residuals 

58.621 

24 





Output 11.9 

11.5. Robust ANCOVA (D 


As with one-way ANOVA, Wilcox (2005) describes a set of robust procedures for conduct¬ 
ing one-way ANCOVA. To access these we again need to load the WRS package (see section 
5.8.4.). There are two functions that we will look at which can be used to compare trimmed 
means between two groups including a covariate: ancova() and ancboot(). These methods 
all work on the same principle. To free the analysis from the restrictions of homogeneity 
of regression slopes, as well as the other distributional assumptions, these tests compare 
trimmed means at different points along the co variate. In other words, rather than assume 
that the relationship between the covariate and outcome variable is constant in the two 
groups, it finds five points where the slopes are the same (i.e., five values of the covari¬ 
ate for which the relationship between the outcome and covariate is roughly the same in 
both groups). It then compares the trimmed means at these five points to see whether 


4 We could also use Type II sums of squares here: because we’re interested only in the highest-order 
interaction, Type II and III sums of squares will give us exactly the same results (see Jane Superbrain 
Box 11.1.). 
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they differ. This process is quite useful because it gives you an idea of how the differences 
between group means changes as a function of the covariate. 

The function ancova() does the analysis just described, and ancboot() does the same but 
uses a bootstrap-t method to compute confidence intervals. These functions have the fol¬ 
lowing basic form: 

ancova(covGrpl, dvGrpl, covGrp2, dvGrp2, tr = .2) 
ancboot(covGrpl, dvGrpl, covGrp2, dvGrp2, tr = .2, nboot = 599) 

In these commands covGrpl is a variable that contains the data for the covariate from 
the first group, dvGrpl is a variable that contains the data for the outcome variable (i.e., 
dependent variable) from the first group; covGrpl is a variable that contains the data 
for the covariate from the second group; and dvGrpl is a variable that contains the data for 
the outcome variable from the second group. The level of trimming is by default 20%, but 
can be changed by including the tr = option. The second command also includes the nboot 
option to control the number of bootstrap samples (the default is 599). 

Let’s take a look at a new example. Two news stories caught my eye that related to some 
physics research (Di Falco, Ploschner, & Krauss, 2010). In the first headline (November 2010) 
the Daily Mirror (a UK newspaper) reported ‘Scientists make Harry Potter’s invisible cloak’. 
I’m not really a Harry Potter aficionado, 5 so it wasn’t his mention that caught my attention, 
but the idea of being able to don a cloak that would render me invisible and able to get up to 
mischief was very exciting indeed. Where could I buy one? By February 2011 the same news¬ 
paper was reporting on a different piece of research (Chen et ah, 2011), but it came with a 
slightly more sedate headline: ‘Harry Potter-style “invisibility cloak” built by scientists’. 

Needless to say, scientists hadn’t actually made Harry Potter’s cloak of invisibility. Di 
Falco et al. had created a flexible material (Metaflex) that had optical properties that meant 
that if you layered it up you might be able to create something around which light would 
bend. Not exactly a cloak in the clothing sense of the word, but easier to wear than, say, a 
slab of granite. Chen et al. also hadn’t made a ‘cloak of invisibility’ in the clothing sense, 
but had created a calcite lump of invisibility. This could hide small objects (centimetres 
and millimeters in scale): you could conceal my brain but little else. Nevertheless, with a 
suitably large piece of calcite in tow, I could theoretically hide my whole body (although 
people might get suspicious of the apparently autonomous block of calcite manoeuvring 
itself around the room on a trolley). 

Although the newspapers probably overstated the case a little, these are two very exciting 
pieces of research that bring the possibility of a cloak of invisibility closer to a reality. So, I 
imagine a future in which we have some cloaks of invisibility to test out. As a psychologist 
(with his own slightly mischievous streak) I might be interested in the effect that wearing a 
cloak of invisibility has on people’s tendency to mischief. I took 80 participants and placed 
them in an enclosed community. The community was riddled with hidden cameras so that we 
could record mischievous acts. We recorded how many mischievous acts everyone conducted 
in the first 3 weeks (mischiefl). After 3 weeks we told about half of the sample (n = 34) that 
we were switching the cameras off so that no one would be able to see what they were getting 
up to; the remainder (« = 46) were given a cloak of invisibility. These people with cloaks were 
told not to tell anyone else about their cloak and that they could wear it whenever they liked. 
We recorded the number of mischievous acts over the next 3 weeks (mischief2). The variable 
cloak records whether or not a person was given a cloak (cloak = 2) or not (cloak = 1). These 
data are in the file CloakofInvisibility.dat. Load this file into a dataframe called invisibility- 
Data by setting your working directory to the correct folder and executing: 

invisibilityData<-read.delim("CloakofInvisibility.dat", header = TRUE) 


5 Though perhaps I should be, given that another UK newspaper once dubbed me ‘the Harry Potter of the social 
sciences’ (http://www.discoveringstatistics.com/docs/thes_170909.pdf). I wasn’t sure whether this made me a 
heroic wizard battling against the evil forces of statistics, or an adult with a mental age of 11. 
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We can convert the numeric variable cloak into a factor (i.e., categorical variable) by executing: 

invisibilityData$cloak<-factorCinvisibilityData$cloak, levels = c(l:2), 

labels = c("No Cloak", "Cloak")) 

We have specified that the levels of cloak are 1 and 2 (levels = c(l:2)), and that we want to 
label these levels as No Cloak and Cloak (labels = c(“No Cloak”, “Cloak”)). 




SELF-TEST 

s Use ggplot2 to produce boxplots for the invisibility 
data. Try to re-create Figure 11.7. 


FIGURE 11.7 

Boxplots of the 
invisibility data 
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Figure 11.7 shows boxplots for the number of mischievous acts before and after the 
cloaks were given out by whether or not the person was given a cloak. Levels of mischief 
are comparable at baseline, and increase in both groups (not surprising given that those 
without cloaks were told that the cameras were being switched off). The whiskers show 
that the spread of scores is greater for the participants who received cloaks. 




SELF-TEST 

s Create a standard ANCOVA model of these data. 
What conclusions can you draw? 


The main difficulty in running robust regression is getting the data into the right format. 
Figure 11.8 shows the data from the invisibilityData dataframe (edited to save space). 
You can see that the groups have been stacked and the covariates and dependent variable 
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(outcome) are in different columns. The functions for robust ANCOVA require us to create 
four variables, which I have labelled as follows in the functions: 

• covGrpl : This variable contains scores for the covariate (mischief 1) for the first 
group (in this case the ‘No Cloak’ group of the cloak variable). This is the upper left 
block of data in Figure 11.8. 

• dvGrpl : This variable contains scores for the dependent variable/outcome (mischief2) 
for the first group (in this case the ‘No Cloak’ group of the cloak variable). This is the 
upper right block of data in Figure 11.8. 

• covGrpl : This variable contains scores for the covariate (mischiefl) for the second 
group (in this case the ‘Cloak’ group of the cloak variable). This is the lower left 
block of data in Figure 11.8. 

• dvGrpl: This variable contains scores for the dependent variable/outcome (mischief2) 
for the second group (in this case the ‘Cloak’ group of the cloak variable). This is the 
lower right block of data in Figure 11.8. 


Group 1 


< 


Group 2 


< 


covGrpl 


r i 


> 


35 

36 

37 

38 

39 


75 

76 

77 

78 

79 

80 


Covariate 
A _ 


'i r 


mischiefl 


Outcome 
_A_ 

mischief2 


Cloak 

f 4 ] 



CloalX 

5 


7 

Cloak \ 

8 


8 

Cloak \ 

6 


7 

Cloak \ 

6 


10 

Cloak \ 

4 


7 

Cloak 

4 


9 

Cloak 

4 


9 

Cloak 

4 


11 

Cloak 

1 


9 

Cloak 

3 


8 

Cloak 

3 


6 

Cloak 

5 


12 

Cloak 

l_4_ 


l 10 




Cloak 

1 ^ I 


f ™ 

Cloak 

7 


10 

Cloak 

7 


9 

Cloak 

6 


12 

Cloak 

9 


11 

Cloak M 

4 


13 

Cloak / 

7 


9 

Cloak / 

2 


10 

Cloak / 

3 


8 

Cloaly' 

6 


10 

Cloak 

0 


10 


kJ 




dvGrpl 



FIGURE 11.8 

Extracting data for 
robust ANCOVA 
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SELF-TEST 

s Can you use what you have learnt about R to create 
the four variables covGrpI, dvGrpI, covGrp2, 
dvGrp2? 


To create these variables, we could start by splitting the dataframe into two new data- 
frames: one for the cloak group and the other for the no-cloak group. We can achieve this 
by executing these commands: 

noCloak<-subset(invisibilityData, cloak=="No Cloak") 
invisCloak<-subset(invisibilityData, cloak=="Cloak") 

Note that we have created two new dataframes (named noCloak and invisCloak). In both 
cases we have used the subset() function (section 3.9.2), specified the original dataframe 
(invisibilityData), set a condition on which to select rows (this condition is that the value 
of the variable cloak is equal to ‘Cloak’ for the first dataframe and ‘No Cloak’ for the 
second). 

We can now create the four variables by selecting the appropriate columns (i.e., vari¬ 
ables) from these new dataframes. Execute these four commands: 

covGrpl<-invisCloak$mischiefl 

dvGrpl<-invisCloak$mischief2 

covGrp2<-noCloak$mischiefl 

dvGrp2<-noCloak$mischief2 

The first command creates a variable called covGrpl which contains the values of the mis¬ 
chief 1 variable within the invisCloak dataframe; the second creates a variable called dvGrpl 
which contains the values of the mischief2 variable within the invisCloak dataframe; the 
third and fourth commands do the same but using the noCloak dataframe. 6 

Having created these variables, we can input them into the robust ANCOVA commands 
(note that I have also changed the number of bootstrap samples to 2000) and execute them: 

ancova(covGrpl, dvGrpl, covGrp2, dvGrp2) 
ancboot(covGrpl, dvGrpl, covGrp2, dvGrp2, nboot = 2000) 

Output 11.10 shows the results of the ancova() function and Output 11.11 shows 
the results from ancboot(). Both of these outputs can be interpreted in the same way. 
The X column indicates five values for the covariate (in this case 2, 4, 5, 6, 7) for which 
the relationship between baseline mischief and post-cloak mischief are comparable in 
the two groups. At these points we are told the number of cases in the data for the two 
groups ( nl and n2) that have a covariate value close to x (not exactly x, but close to 
it). Based on these two samples, trimmed means (20% by default) are computed and 
the difference between them tested. This difference is stored in the column DIF and its 
estimates standard error in the se column. The test statistic comparing the difference is 


6 The astute amongst you might wonder why we don’t create these variables directly from the original dataframe. 
For example, we could create covGrpl by executing: 

covGrpl<-subset(invisibilityData, cloak="Cloak", select = mischiefl) 

However, if we used this command, covGrpl would be a dataframe and not a variable. The robust ANCOVA 
commands don’t seem to like dataframes, which is why we don’t use this quicker method. 
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in the TEST column (and is just the difference divided by the standard error). The confi¬ 
dence interval of the difference between trimmed means is included (these are corrected 
to control for the fact that we have done five tests). Note that the confidence intervals 
are the first things to be different in the two outputs: this is because in the output from 
ancboot() these confidence intervals are based on bootstrapping. Finally, we are told a 
p-value for the test of the difference between trimmed means. If this value is less that 
.05 we conclude that there is a significant difference between the trimmed means when 
adjusting for the covariate. 

[1] "NOTE: Confidence intervals are adjusted to control the probability" 
[1] "of at least one Type I error." 

[1] "But p-values are not" 

$output 

X nl n2 DIF TEST se ci.low ci.hi p.value crit.val 

2 21 17 1.4056 1.8882 0.74441 -0.673383 3.4846 0.072261 2.79278 

4 31 26 1.7336 3.1302 0.55382 0.226996 3.2401 0.003720 2.72031 

5 32 26 1.0125 1.6767 0.60388 -0.639430 2.6644 0.104360 2.73551 

6 29 24 1.1711 2.3109 0.50675 -0.205854 2.5480 0.027304 2.71716 

7 24 17 1.3750 2.6145 0.52591 -0.079079 2.8291 0.015021 2.76490 

Output 11.10 

[1] "Note: confidence intervals are adjusted to control FWE" 

[1] "But p-values are not adjusted to control FWE" 

[1] "Taking bootstrap samples. Please wait." 

$output 



X 

nl 

n2 

DIF 

TEST 

ci.low 

ci.hi 

p.value 

[1, ] 

2 

21 

17 

1.405594 

1.888193 

-0.63118033 

3.442369 

0.0800 

[2, ] 

4 

31 

26 

1.733553 

3.130180 

0.21825768 

3.248848 

0.0050 

[3, ] 

5 

32 

26 

1.012500 

1.676646 

-0.63977784 

2.664778 

0.1140 

[4, ] 

6 

29 

24 

1.171053 

2.310930 

-0.21544474 

2.557550 

0.0270 

[5, ] 

7 

24 

17 

1.375000 

2.614530 

-0.06392599 

2.813926 

0.0115 


$crit 

[1] 2.736084 

Output 11.11 

Outputs 11.10 and 11.11 show significant differences between trimmed means for four 
of the five design points. In other words, in most cases the groups differ significantly in 
their mean level of mischief after the intervention (adjusted for baseline levels of mischie¬ 
vousness). We didn’t get a significant difference for values of the covariate around 5 (the 
middle of the five design points tested), which seems to suggest that having an invisibility 
cloak increased mischievousness in those who were ordinarily not very mischievous (base¬ 
line scores around 2 and 4) or ordinarily highly mischievous (baseline scores around 6 and 
7), but not in the ‘averagely mischievous’ person. 

The robust ANCOVA function also produces a plot (Figure 11.9) of the covariate plot¬ 
ted against the outcome variable. Two regression splines are fitted (one for each group) 
but note that these are not straight lines (i.e., the slopes are not assumed to be linear). We 
can use this graph to help interpret the results by looking at the spread of data points for 
the two groups at each of the design points in the robust analysis (i.e., values of x = 2, 4, 
5, 6, 7). Notice that the circles are usually higher than the crosses. The one exception is 
when X = 5, where there is a cross at the highest point and a circle at the lowest point. This 
probably explains why we found no significant group differences at this design point in 
the robust analysis (it is the one point where it is not obvious that the circles are generally 
higher than the crosses). 
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FIGURE 11.9 
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Muris, R, et al. (2008). Child Psychiatry and Human Development, 39, 469-480. 


Anxious people tend to interpret ambiguous information in a negative way. For example, being highly anxious 
myself, if I overheard a student saying 'Andy Field’s lectures are really different' I would assume that ‘different’ 
meant ‘rubbish’, but it could also mean ‘refreshing’ or ‘innovative’. One current mystery is how these interpre- 
tational biases develop in children. Peter Muris and his colleagues addressed this issue in an ingenious study. 
Children did a task in which they imagined that they were astronauts who had discovered a new planet. Although 
the planet was similar to Earth, some things were different. They were given some scenarios about their time 
on the planet (e.g., ‘On the street, you encounter a spaceman. Fie has a sort of toy handgun and he fires at you 
...’) and the child had to decide which of two outcomes occurred. One outcome was positive (‘You laugh: it is a 
water pistol and the weather is fine anyway’) and the other negative (‘Oops, this hurts! The pistol produces a red 
beam which burns your skin!’). After each response the child was told whether their choice was correct. Half of 
the children were always told that the negative interpretation was correct, and the remainder were told that the 
positive interpretation was correct. 

Over 30 scenarios children were trained to interpret their experiences on the planet as negative or positive. 
Muris et al. then gave children a standard measure of interpretational biases in everyday life to see whether the 
training had created a bias to interpret things negatively. In doing so, they could ascertain whether children learn 
interpretational biases through feedback (e.g., from parents) about how to disambiguate ambiguous situations. 

The data from this study are in the file Muris et al (2008).dat. The independent variable is Training (positive 
or negative) and the outcome was the child’s interpretational bias score (lnterpretational_Bias) - a high score 
reflects a tendency to interpret situations negatively. It is important to factor in the Age and Gender of the child 
and also their natural anxiety level (which they measured with a standard questionnaire of child anxiety called the 
SCARED) because these things affect interpretational biases also. Labcoat Leni wants you to carry out a one¬ 
way ANCOVA on these data to see whether Training significantly affected children’s lnterpretational_ 
Bias using Age, Gender and SCARED as covariates. What can you conclude? 

Answers are in the additional material on the companion website (or look at pages 475-476 in the 
original article). 
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ANCOVA 


• Analysis of covariance (ANCOVA) compares several means, but adjusting for the effect of one or more other variables (called 
covariates ); for example, if you have several experimental conditions and want to adjust for the age of the participants. 

• Before the analysis you should check that the independent variables and covariate(s) are independent. You can do this using 
ANOVA or a f-test to check that levels of the covariate do not differ significantly across groups. 

• You need to decide whether to use Type I or Type III sums of squares. If you use Type III you must do an orthogonal contrast 
rather than a non-orthogonal one. 

• If you have generated specific hypotheses before the experiment use planned comparisons. You obtain these contrasts using 
the contrast() function. 

• In the resulting output from the ANCOVA, look at the p-value for both the covariate and the independent variable. If the value 
is less than .05 then for the covariate it means that this variable has a significant relationship to the outcome variable; for the 
independent variable it means that the means are significantly different across the experimental conditions after partialling 
out the effect that the covariate has on the outcome. 

• If you don't have specific hypotheses you can use post hoc tests by using the glhtQ function. 

• For contrasts and post hoc tests, again look to the p-values to discover if your comparisons are significant (they will be if the 
significance value is less than .05). 

• Test the same assumptions as for ANOVA, but in addition you should test the assumption of homogeneity of regression 
slopes. This has to be done by customizing the ANCOVA model to look at the independent variable xcovariate interaction. 


11.6. Calculating the effect size © 


We saw in the previous chapter that we can use eta squared, g 2 , as an effect size measure 
in ANOVA. This effect size is just r 2 by another name and is calculated by dividing the 
effect of interest, SS M , by the total amount of variance in the data, SS r As such, it is 
the proportion of total variance explained by an effect. In ANCOVA (and some of the 
more complex ANOVAs that we’ll encounter in future chapters), we have more than 
one effect; therefore, we could calculate eta squared for each effect. However, we can 
also use an effect size measure called partial eta squared (partial rf). This differs from eta 
squared in that it looks not at the proportion of total variance that a variable explains, 
but at the proportion of variance that a variable explains that is not explained by other 
variables in the analysis. Let’s look at this with our example; say we want to know the 
effect size of the dose of Viagra. Partial eta squared is the proportion of variance in libido 
that the dose of Viagra shares that is not attributed to partner’s libido (the covariate). If 
you think about the variance that the covariate cannot explain, there are two sources: 
it cannot explain the variance attributable to the dose of Viagra, SS viagra and it cannot 
explain the error variability, SS R . Therefore, we use these two sources of variance instead 
of the total variability, SS^, in the calculation. The difference between eta squared and 
partial eta squared is shown as: 




Residual 


( 11 . 2 ) 


To calculate it for our Viagra example, we need to use the sums of squares in Output 11.3 
for the effect of dose (25.19), the covariate (15.08) and the error (79.05): 
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partial ^ 


SS r 


S^Dose + ^Residual 


25.19 

25.19 + 79.05 

25.19 


partial ^p artncr Libido = 


SS 


Partner Libido 


SS 


Partner Libido 


+ SS 


Residual 


15.08 

15.08 + 79.05 
15.08 


104.24 94.13 

= .24 =.16 


These values show that dose explained a bigger proportion of the variance not attributable 
to other variables than partnerLibido. 

As with ANOVA, you can also use omega squared (co 1 ). However, as we saw in section 
10.7, this measure can be calculated only when we have equal numbers of participants in 
each group (which is not the case in this example). So, we’re a bit stumped. 

However, all is not lost because, as I’ve said many times already, the overall effect size is 
not nearly as interesting as the effect size for more focused comparisons. If we think about 
the planned contrasts that we did, we can use the same equation as in section 9.6.3.8: 



+ 26 


Remember that in section 10.7 we wrote a function to compute this for us called rcon- 
trast(), which you should be able to use if you have the package associated with this book, 
DSUR, loaded - see section 3.4.5). All we need are the values of t and df. 

Output 11.5 gives us the value of t for the covariate (2.227) and our contrasts compar¬ 
ing different groups. 7 The degrees of freedom can be calculated as in normal regression 
(see section 7.2.4) as N - p — 1, in which N is the total sample size (in this case 30), and p 
is the number of predictors (in this case 3, the two contrast variables and the co variate). 
Therefore, the degrees of freedom are 26. 

Therefore, first, create a variable (I’ve called it t ) containing the three values of t from 
Output 11.5, and another called df that is the value of the degrees of freedom: 

t<-c(2.227, 2.785, 0.541) 
df<-26 

We can print the corresponding effect sizes for the three f-values to the console by placing 
these variables t and df into the rcontrast() function and executing: 

rcontrast(t, df) 

Having executed this command R will print the resulting values to the console: 

effect_size r 

1 r = 0.400 

2 r = 0.479 

3 r = 0.106 

If you think back to our benchmarks for effect sizes, the effect of the covariate (.400) and 
the difference between the combined dose groups and the placebo (.479) both represent 
medium to large effect sizes (they’re both between .4 and .5). Therefore, as well as being 


7 We should use a slightly more elaborate procedure when groups are unequal. It’s a bit beyond the 
scope of this book, but Rosnow, Rosenthal, and Rubin (2000) give a very clear account. 
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statistically significant, these effects are substantive findings. The difference between the 
high- and low-dose groups (.106) was a fairly small effect. 

An alternative is to calculate effect sizes between all combinations of groups, just as we 
did for ANOVA. We could again use the mes() function from the calculate.es package: 

mes(mean groupl , fn6Qn grou p2, ^dg rou pj, ^d grou p2, tlgr 0u pi, ^groupZ^ 

We know the adjusted means (Output 11.4) and sample sizes, the problem is that we 
don’t know the adjusted standard deviations. We could either use the unadjusted standard 
deviations as an approximation, or we could estimate them from the standard errors of 
the adjusted means (Output 11.4). We discovered in Chapter 2 that the standard error is 
the standard deviation divided by the square root of the sample size on which the mean is 
based. If we rearrange this equation we get: 

s=o-y[N 

In other words, the standard deviation is the standard error multiplied by the square root 
of the sample size. 8 We already have the standard errors for the adjusted means stored in 
the variable adjustedMeans$se (see section 11.4.8). If we create a variable, n, containing 
the three group’s sample sizes by executing: 

n<-c(9,8,13) 

then we can approximate the standard deviations by multiplying the square root of this vari¬ 
able (sqrt(n) in R-speak) by the corresponding standard errors (stored in adjustedMeans$se). 
Therefore, to print the standard deviations to the console, execute: 

adjustedMeans$se*sqrt(n) 

You should find that the values are: 

1.788613 1.755879 1.812267 

Now we have all the information we need to use the mes() function. For example, if we 
want to compare the low-dose group with the placebo we would execute: 

mes(5.988117, 4.151886, 1.755879, 1.788613, 8, 9) 

We have entered the mean of the low-dose group (5.988117), the mean of the placebo 
group (4.151886), the corresponding standard deviations (1.755879 and 1.788613), and 
the sample sizes (8 and 9). 

Similarly we can get effect sizes for the difference between the high-dose and placebo 
groups by executing: 

mes(6.184427, 4.151886, 1.812267, 1.788613, 13, 9) 

Finally, the difference between the high- and low-dose groups can be quantified by 
executing: 

mes(6.184427, 5.988117, 1.812267, 1.755879, 13, 8) 

The outputs of these commands are shown in Output 11.12 (I have edited the outputs to 
show only the effect sizes d and r). The difference between the low-dose and placebo group 
is a large effect (the adjusted means are about a standard deviation different), d = 1.04, r = 
.46; the difference between the high-dose and placebo groups is also a large effect (over a 
standard deviation difference between the adjusted group means), d = 1.13, r = .48; finally, 


Strictly speaking, this is true only when the sample size is greater than about 30, which is not the case here. 
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the difference between the high- and low-dose groups is a very small effect (the adjusted 
means are about a tenth of standard deviation different), d = 0.11, r = .05. 

Low Dose vs. Placebo: 

$MeanDif f erence 

d var.d g var.g 

1.0354225 0.2676435 0.9827739 0.2411175 

$Correlation 

r var.r 

0.45912390 0.03277639 

High Dose vs. Placebo: 

$MeanDif f erence 

d var.d g var.g 

1.1274090 0.2169217 1.0845960 0.2007595 

$Correlation 

r var.r 

0.48480943 0.02347250 

High Dose vs. Low Dose: 

$MeanDif f erence 

d var.d g var.g 

0.1095664 0.2022089 0.1051837 0.1863557 

$Correlation 

r var.r 

0.05313258 0.04728373 

Output 11.12 


11.7. Reporting results © 


Reporting ANCOVA is much the same as reporting ANOVA, except we now have to report 
the effect of the covariate as well. For the covariate and the experimental effect we give 
details of the F-ratio and the degrees of freedom from which it was calculated. In both 
cases, the F-ratio was derived from dividing the mean squares for the effect by the mean 
squares for the residual. Therefore, the degrees of freedom used to assess the F-ratio are 
the degrees of freedom for the effect of the model ( df u = 1 for the covariate and 2 for the 
experimental effect) and the degrees of freedom for the residuals of the model (df R = 26 for 
both the covariate and the experimental effect) - see Output 11.3. Therefore, the correct 
way to report the main findings would be: 

• The covariate, partner’s libido, was significantly related to the participant’s libido, 
F(l, 26) = 4.96, p < .05, r = .40. There was also a significant effect of the dose of 
Viagra on levels of libido after controlling for the effect of partner’s libido, F(2, 26) 
= 4.14, p < .05, partial rf = .24. 

We can also report some contrasts (see Output 11.5): 
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• Planned contrasts revealed that taking a high or low dose of Viagra significantly 
increased libido compared to taking a placebo, t{26) = 2.79, p < .01, r = .48; there 
was no significant difference between the high and low doses of Viagra, t{26) = 0.54, 
p = .59, r = .11. 

Post hoc tests could be reported as follows (see Output 11.6): 

• Tukey post hoc tests revealed that the covariate adjusted mean of the high-dose group 
was significantly greater than that of the placebo (difference = 2.22, t = 2.77', p < .05, 
d = 1.13). However, there was no significant difference between the low-dose and pla¬ 
cebo groups (difference = 1.79, t = 2.10, p = .11, d = 1.04) and between the low-dose 
and high-dose groups (difference = 0.44, t = 0.54, p = .85, d = 0.11). Despite the lack 
of significance between the low-dose and placebo groups, the effect size was quite large. 



What have I discovered about statistics? © 


This chapter has shown you how the general linear model that was described in Chapter 
10 can be extended to include additional variables. The advantages of doing so are that 
we can remove the variance in our outcome that is attributable to factors other than our 
experimental manipulation. This gives us tighter experimental control, and may also 
help us to explain some of our error variance, and, therefore, give us a purer measure 
of the experimental manipulation. We didn’t go into too much theory about ANCOVA, 
we just looked conceptually at how the regression model can be expanded to include 
these additional variables {covariates). Instead we jumped straight into an example, 
which was to look at the effect of Viagra on libido (as in Chapter 10) but including 
partner’s libido as a covariate. I explained how to do the analysis using R and interpret 
the results. We also looked at an additional assumption that has to be considered when 
doing ANCOVA: the assumption of homogeneity of regression slopes. This just means 
that the relationship between the covariate and the outcome variable should be the same 
in all of your experimental groups. We finished off by looking at some very state-of-the- 
art robust versions of ANCOVA for when our data are up to mischief; we also learnt 
that this would be more likely if they possessed an invisibility cloak. The moral here is 
never to give your data set an invisibility cloak. 

Having seen Iron Maiden in all of their glory, I was inspired. Although I had briefly 
been deflected from my destiny by the shock of grammar school, I was back on track. I 
had to form a band. There was just one issue: no one else played a musical instrument. 
The solution was easy: through several months of covert subliminal persuasion I con¬ 
vinced my two best friends (both called Mark, oddly enough) that they wanted nothing 
more than to start learning the drums and bass guitar. A power trio was in the making! 


R packages used in this chapter 


car 

compute.es 

effects 

ggplot2 


multcomp 

pastecs 

WRS 
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R functions used in this chapter 


ancboot() 

ancovaO 

Anova() 

aov() 

byO 

confint() 

contrastsO 

dropl () 

effect() 

factorO 

ggpioto 


gihto 

leveneTestO 

lm() 

mes() 

plotO 

reshape() 

stat.descO 

subset() 

summaryO 

summary.lmO 

updateQ 


Key terms that I’ve discovered 


Adjusted mean 

Analysis of covariance (ANCOVA) 
Covariate 


Homogeneity of regression slopes 
Partial eta squared (partial r? 2 ) 
Partial out 


Smart Alex’s tasks 



• Task 1: Stalking is a very disruptive and upsetting (for the person being stalked) expe¬ 
rience in which someone (the stalker) constantly harasses or obsesses about another 
person. It can take many forms, from being sent intensely disturbing letters threatening 
to boil your cat if you don’t reciprocate the stalker’s undeniable love for you, to follow¬ 
ing you around your local area in a desperate attempt to see which CD you buy on a 
Saturday. A psychologist, who’d had enough of being stalked by people, decided to try 
two different therapies on different groups of stalkers (25 stalkers in each group - this 
variable is called Group). To the first group of stalkers he gave what he termed cruel-to- 
be-kind therapy. This therapy was based on punishment for stalking behaviours: every 
time the stalkers followed him around, or sent him a letter, the psychologist attacked 
them with a cattle prod. The second therapy was psychodyshamic therapy, which is a 
recent development on psychodynamic therapy that acknowledges its limited empirical 
support (you could say it’s based on Fraudian theory). In keeping with Freud’s ideas the 
therapist would discuss the stalker’s penis (or lack of it if they were a woman), the penis 
of their father, their dog’s penis, the penis of the cat down the road and anyone else’s 
penis that sprang to mind. At the end of therapy, the psychologist measured the number 
of hours in the week that the stalker spent stalking their prey (stalk2). The therapist 
believed that the success of therapy might well depend on how bad the problem was to 
begin with, so had measured the number of hours that the patient spent stalking prior 
to treatment (stalkl). The data are in the file Stalker.dat. Analyse the effect of therapy 
on stalking behaviour after therapy, controlling for the amount of stalking behaviour 
before therapy. Also try conducting a robust ANCOVA. © 

• Task 2: A marketing manager for a certain well-known drinks manufacturer was 
interested in the therapeutic benefit of certain soft drinks for curing hangovers. Fie 
took 15 people out on the town one night and got them drunk. The next morning as 
they awoke, dehydrated and feeling as though they’d licked a camel’s sandy feet clean 
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with their tongue, he gave five of them water to drink, five of them Lucozade (in case 
this isn’t sold outside the UK, it’s a very nice glucose-based drink) and the remaining 
five a leading brand of cola (this variable is called drink). He then measured how 
well they felt (on a scale from 0 = 1 feel like death to 10 = I feel really full of beans 
and healthy) two hours later (this variable is called well). He wanted to know which 
drink produced the greatest level of wellness. However, he realized it was important 
to control for how drunk the person got the night before, and so he measured this on 
a scale from 0 = as sober as a nun to 10 = flapping about like a haddock out of water 
on the floor in a puddle of their own vomit. The data are in the file HangoverCure. 
dat. Conduct an ANCOVA to see whether people felt better after different drinks 
when controlling for how drunk they were the night before. © 


• Task 3: The annual elephant football (soccer) event in Nepal 9 is the highlight of the 
elephant calendar. However, in recent years a heated argument has arisen between 
the African and Asian elephants. It started in 2010 when the president of the Asian 
Elephant Football Association, an elephant named Boji, claimed that Asian elephants 
were more talented than their African counterparts. The head of the African Elephant 
Soccer Association, an elephant called Tunc, replied in a press statement that read ‘I 
make it a matter of personal pride never to take seriously any remark made by some¬ 
thing that looks like an enormous scrotum’. I was called in to settle things. I collected 
data from the two types of elephants (elephant) over a season. For each elephant, I 
measured how many goals they scored in the season (goals) and how many years of 
experience they had (experience). The data are in Elephant Football.dat. Analyse the 
effect of the type of elephant on goal scoring, controlling for the amount of football 
experience the elephant has. Also try conducting a robust ANCOVA. © 

The answers are on the companion website, and task 1 has a full interpretation in Field 
and Hole (2003). 



Further reading 


Howell, D. C. (2006). Statistical methods for psychology (6th ed.). Belmont, CA: Duxbury. (Or you 
might prefer his Fundamental statistics for the behavioral sciences, also in its 6th edition, 2007.) 

Miller, G. A., & Chapman, I. P. (2001). Misunderstanding analysis of covariance. Journal of Abnormal 
Psychology, 110, 40-48. 

Rutherford, A. (2000). Introducing ANOVA and ANCOVA: A GLM approach. London: Sage. 
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Muris, P., Huijding, J., Mayer, B., & Hameetman, M. (2008). A space odyssey: Experimental manip¬ 
ulation of threat perception and anxiety-related interpretation bias in children. Child Psychiatry 
and Human Development, 39(4), 469H-80. 

9 http ://news.bbc.co.uk/l/hi/8435112.stm 
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FIGURE 12.1 

Andromeda 
coming to a living 
room near you in 
1988 (from left to 
right: Malcolm, me 
and the two Marks) 



12.1. What will this chapter tell me? © 


After persuading my two friends (Mark and Mark) to learn the bass and drums, I took the 
rather odd decision to stop playing the guitar. I didn’t stop, as such, but I focused on sing¬ 
ing instead. In retrospect, I’m not sure why because I am not a good singer. Mind you, I’m 
not a good guitarist either. The upshot was that a classmate, Malcolm, ended up as our 
guitarist. I really can’t remember how or why we ended up in this configuration, but we 
called ourselves Andromeda, we learnt several Queen and Iron Maiden songs and we were 
truly awful. I have some tapes somewhere to prove just what a cacophony of tuneless drivel 
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we produced, but the chances of these recordings appearing on the companion website are 
slim at best. Suffice it to say, you’d be hard pushed to recognize which Iron Maiden and 
Queen songs we were trying to play. I try to comfort myself with the fact that we were only 
14 or 15 at the time, but even youth does not excuse the depths of ineptitude to which 
we sank. Still, we garnered a reputation for being too loud in school assembly and we 
did a successful tour of our friends’ houses (much to their parents’ amusement I’m sure). 
We even started to write a few songs (I wrote one called ‘Escape from Inside’, about the 
film The Fly, that contained the wonderful rhyming couplet of ‘I am a fly, I want to die’: 
genius!). The only thing that we did that resembled the activities of a ‘proper’ band was 
to split up due to ‘musical differences’; these differences being that Malcolm wanted to 
write 15-part symphonies about a boy’s journey to worship electricity pylons and discover 
a mythical beast called the cuteasauros, whereas I wanted to write songs about flies and 
dying (preferably both). When we could not agree on a musical direction the split became 
inevitable. We could have tested empirically the best musical direction for the band by writ¬ 
ing and performing two songs: Malcolm his 15-part symphony and me my 3-minute song 
about a fly. If we played these songs to various people and measured their screams of agony 
then we could ascertain the best musical direction to gain popularity. We have two variables 
that predict screams: whether Malcolm or I wrote the song (songwriter), and whether the 
song was a 15-part symphony or a song about a fly (song type). The one-way ANOVA that 
we encountered in Chapter 10 cannot deal with two predictor variables - this is a job for 
factorial ANOVA. 


12.2. Theory of factorial ANOVA 
(independent design) © 


In the previous two chapters we have looked at situations in which we’ve tried to test for 
differences between groups when there has been a single independent variable (i.e., one 
variable has been manipulated). However, at the beginning of Chapter 10 I said that one of 
the advantages of ANOVA was that we could look at the effects of more than one independ¬ 
ent variable (and how these variables interact). This chapter extends what we already know 
about ANOVA to look at situations where there are two (or more) independent variables. 
We’ve already seen in the previous chapter that it’s very easy to incorporate a second vari¬ 
able into the ANOVA framework when that variable is a continuous variable (i.e., not split 
into groups), but now we’ll move on to situations where there is a second independent vari¬ 
able that has been systematically manipulated by assigning people to different conditions. 


12 . 2 . 1 . 


Factorial designs © 


In Chapters 10 and 11 we have looked at the effects of a single independent variable on 
some outcome. However, independent variables often get lonely and want to have friends. 
Scientists are obliging individuals and often put a second (or third) independent variable 
into their designs to keep the others company. When an experiment has two or more inde¬ 
pendent variables it is known as a factorial design (this is because, as we have seen, variables 
are sometimes referred to as factors). There are several types of factorial design: 

• Independent factorial design: In this type of experiment there are several independent 
variables or predictors and each has been measured using different entities (between 
groups). We discuss this design in this chapter. 
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• Repeated-measures (related) factorial design: This is an experiment 
in which several independent variables or predictors have been measured, 
but the same entities have been used in all conditions. This design is discussed 
in Chapter 13. 

• Mixed design: This is a design in which several independent variables 
or predictors have been measured; some have been measured with different 
entities, whereas others used the same entities. This design is discussed in 
Chapter 14. 

As you might imagine, analysing these types of experiments can get quite com¬ 
plicated. Fortunately, we can extend the ANOVA model that we encountered 
in the previous two chapters to deal with these more complicated situations. When we 
use ANOVA to analyse a situation in which there are two or more independent variables 
it is sometimes called factorial ANOVA; however, the specific names attached to different 
ANOVAs reflect the experimental design that they are being used to analyse (see Jane 
Superbrain Box 12.1). This section extends the one-way ANOVA model to the factorial 
case (specifically when there are two independent variables). In subsequent chapters we 
will look at repeated-measures designs, factorial repeated-measures designs and finally 
mixed designs. 




JANE SUPERBRAIN 12.1 

Naming ANOVAs © 

ANOVAs can be quite confusing because there appear 
to be lots of them. When you read research articles 
you’ll quite often come across phrases like ‘a two-way 
independent ANOVA was conducted’, or ‘a three-way 
repeated-measures ANOVA was conducted'. These 
names may look confusing but they are quite easy if 
you break them down. All ANOVAs have two things in 
common: they involve some quantity of independent 
variables, and these variables can be measured using 
either the same or different participants. If the same par¬ 
ticipants are used we typically use the term repeated 
measures, and if different participants are used we 
use the term independent. When there are two or more 
independent variables, it’s possible that some variables 
use the same participants whereas others use different 


participants. In this case we use the term mixed. When 
we name an ANOVA, we are simply telling the reader how 
many independent variables we used and how they were 
measured. In general terms we could write the name of 
an ANOVA as: 

• (number of independent variables)-way (how these 
variables were measured) ANOVA. 

By remembering this you can understand the name of 
any ANOVA you come across. Look at these examples 
and try to work out how many variables were used and 
how they were measured: 

• one-way independent ANOVA; 

• two-way repeated-measures ANOVA; 

• two-way mixed ANOVA; 

• three-way independent ANOVA. 

The answers you should get are: 

• one independent variable measured using different 
participants; 

• two independent variables both measured using the 
same participants; 

• two independent variables: one measured using dif¬ 
ferent participants and the other measured using the 
same participants; 

• three independent variables all of which are measured 
using different participants. 
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12.3. Factorial ANOVA as regression © 


12.3.1. 


An example with two independent variables © 


Throughout this chapter we’ll use an example that has two independent variables. This is 
known as a two-way ANOVA (see Jane Superbrain Box 12.1). I’ll look at an example with 
two independent variables because this is the simplest extension of the ANOVAs that we 
have already encountered. 

An anthropologist was interested in the effects of alcohol on mate selection at night¬ 
clubs. Her rationale was that after alcohol had been consumed, subjective perceptions 
of physical attractiveness would become more inaccurate (the well-known beer-goggles 
effect). She was also interested in whether this effect was different for men and women. 
She picked 48 students: 24 male and 24 female. She then took groups of eight participants 
to a nightclub and gave them no alcohol (participants received placebo drinks of alcohol- 
free lager), 2 pints of strong lager, or 4 pints of strong lager. At the end of the evening she 
took a photograph of the person that the participant was chatting up. She then got a pool 
of independent judges to assess the attractiveness of the person in each photograph (out of 
100). The data are in Table 12.1 and goggles.csv. 


Table 12.1 Data for the beer-goggles effect 


Alcohol 

None 

2 Pints 

4 Pints 

Gender 

Female 

Male 

Female 

Male 

Female 

Male 


65 

50 

70 

45 

55 

30 


70 

55 

65 

60 

65 

30 


60 

80 

60 

85 

70 

30 


60 

65 

70 

65 

55 

55 


60 

70 

65 

70 

55 

35 


55 

75 

60 

70 

60 

20 


60 

75 

60 

80 

50 

45 


55 

65 

50 

60 

50 

40 

Total 

485 

535 

500 

535 

460 

285 

Mean 

60.625 

66.875 

62.50 

66.875 

57.50 

35.625 

Variance 

24.55 

f 06.70 

42.86 

156.70 

50.00 

117.41 



12.3.2. 


Extending the regression model © 


We saw in section 10.2.3 that one-way ANOVA could be conceptualized as a regression 
equation (a general linear model). In this section we’ll consider how we extend this linear 
model to incorporate two independent variables. To keep things as simple as possible I 
want you to imagine that we have only two levels of the alcohol variable in our example 
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(none and 4 pints). As such, we have two predictor variables, each with two levels. All of 
the general linear models we’ve considered in this book take the general form of: 

outcome, = (model) + error, 


For example, when we encountered multiple regression in Chapter 7 we saw that this 
model was written as (see equation (7.9)): 

Y i = (b 0 + b t X u + b 2 X 2i + ... + b n X m ) +e,. 

Also, when we came across one-way ANOVA, we adapted this regression model to concep¬ 
tualize our Viagra example, as (see equation (10.2)): 

libido,. = (b 0 + b 2 high,. + fejlow,) +£, 

In this model, the high and low variables were dummy variables (i.e., variables that can 
take only values of 0 or 1). In our current example, we have two variables: gender (male or 
female) and alcohol (none and 4 pints). We can code each of these with zeros and ones; for 
example, we could code gender as male = 0, female = 1, and we could code the alcohol 
variable as 0 = none, 1=4 pints. We could then directly copy the model we had in one¬ 
way ANOVA: 

attractiveness,. = ( b 0 + hjgender, + alcohol,) +£, 


However, this model does not consider the interaction between gender and alcohol. If we 
want to include this term too, then the model simply extends to become (first expressed 
generally and then in terms of this specific example): 

attractiveness,. = ( b 0 + b 2 A : + b 1 B i + b 3 AB i )+ £,. 

attractiveness, = ( b 0 + hjgender, + b 2 alcohol, +b 3 interaction,) + £, (12 1) 

The question is: how do we code the interaction term? The interaction term represents the 
combined effect of alcohol and gender; to get any interaction term in regression you simply 
multiply the variables involved. This is why you see interaction terms written as gender x 
alcohol, because in regression terms the interaction variable literally is the two variables 
multiplied by each other. Table 12.2 shows the resulting variables for the regression (note 
that the interaction variable is simply the value of the gender dummy variable multiplied 
by the value of the alcohol dummy variable). So, for example, a male receiving 4 pints of 
alcohol would have a value of 0 for the gender variable, 1 for the alcohol variable and 0 
for the interaction variable. The group means for the various combinations of gender and 
alcohol are also included because they’ll come in useful in due course. 


Table 12.2 Coding scheme for factorial ANOVA 


Gender 

Alcohol 

Dummy (Gender) 

Dummy (Alcohol) 

Interaction 

Mean 

Male 

None 

0 

0 

0 

66.875 

Male 

4 Pints 

0 

1 

0 

35.625 

Female 

None 

1 

0 

0 

60.625 

Female 

4 Pints 

1 

1 

1 

57.500 
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To work out what the b-values represent in this model we can do the same as we did for 
the t-test and one-way ANOVA; that is, look at what happens when we insert values of 
our predictors (gender and alcohol). To begin with, let’s see what happens when we look 
at men who had no alcohol. In this case, the value of gender is 0, the value of alcohol is 0 
and the value of interaction is also 0. The outcome we predict (as with one-way ANOVA) 
is the mean of this group (66.875), so our model becomes: 


attractiveness; = ( b 0 + hjgendep + b 2 alcohol, +b 3 interaction, ) + e t 

^Men,None = K + X 0) + (b 2 X 0) + (b, X 0) 
h —Y 

U 0 ^ Men, None 

b 0 = 66.875 


So, the constant b 0 in the model represents the mean of the group for which all variables 
are coded as 0. As such it’s the mean value of the base category (in this case men who had 
no alcohol). 

Now, let’s see what happens when we look at females who had no alcohol. In this case, 
the gender variable is 1 and the alcohol and interaction variables are still 0. Also remember 
that b 0 is the mean of the men who had no alcohol. The outcome is the mean for women 
who had no alcohol. Therefore, the equation becomes: 


X 

X 

X 


Women, None 


Women, None 


Women, None 


b i 

K 

b i 


= b 0 + (hj x 1) + (b 2 x 0)+ (b 3 x 0) 
= b 0 +b 1 

~ ^Men.None + b l 

= y _ v 

Women,None Men,None 

= 60.625-66.875 
= -6.25 


So, b 1 in the model represents the difference between men and women who had no alcohol. 
More generally we can say it’s the effect of gender for the base category of alcohol (the base 
category being the one coded with 0, in this case no alcohol). 

Now let’s look at males who had 4 pints of alcohol. In this case, the gender variable is 0, 
the alcohol variable is 1 and the interaction variable is still 0. We can also replace b 0 with 
the mean of the men who had no alcohol. The outcome is the mean for men who had 4 
pints. Therefore, the equation becomes: 


X 

X 

X 


Men,4 Pints b 0 + ( b l X 0) + ( b 2 X 1)+ (fe 3 X 0) 


Men,4 Pints b 0 +b 2 

Men,4 Pints ^ Men, None b 2 

h — Y — Y 

U 1 "^Men,4 Pints ^Men.None 

b 2 = 35.625-66.875 
b 2 =- 31.25 


So, b, in the model represents the difference between having no alcohol and 4 pints in 
men. Put more generally, it’s the effect of alcohol in the base category of gender (i.e., the 
category of gender that was coded with a 0, in this case men). 
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Finally, we can look at females who had 4 pints of alcohol. In this case, gender is 1, alcohol is 
1 and interaction is also 1. We can also replace b 0 , b 1 and b 2 with what we now know they rep¬ 
resent. The outcome is the mean for women who had 4 pints. Therefore, the equation becomes: 


X. 


Women,4 Pints ^0 + (^1 X + (^2 X 1) + (^3 X 


^Women,4 Pints b Q + b 1 + b 2 + b } 


^Women,4 Pints ^Men,None ^^Women,None ^ Men, None 1 


+ 1X, 


Men, 4 Pints 


X, 


e)+ 


^Women,4 Pints ^Women,None ^Men, 4 Pints ^Men.None ^3 

U —Y _ y I y _ y 

Men,None Women,None Women, 4 Pints Men,4 Pints 

b 3 = 66.875-60.625 + 57.500-35.625 
b 3 = 28.125 


So, b } in the model really compares the difference between men and women in the no-alcohol 
condition to the difference between men and women in the 4 pints condition. Put another 
way, it compares the effect of gender after no-alcohol to the effect of gender after 4 pints. 1 
If you think about it in terms of an interaction graph, this makes perfect sense. For example, 
the top left-hand side of Figure 12.2 shows the interaction graph for these data. Now imagine 
we calculated the difference between men and women for the no-alcohol groups. This would 


FIGURE 12.2 

Breaking down 
what an interaction 
represents 


Interaction No Interaction 



None 4 Pints None 4 Pints 

Alcohol Consumption 


Gender 

-o- Male 
—a— Female 


1 In fact, if you rearrange the terms in the equation you’ll see that you can also phrase the interaction the opposite 
way around: it represents the effect of alcohol in men compared to women. 
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be the difference between the lines on the graph for the no-alcohol group (the difference 
between group means, which is 6.25). If we then do the same for the 4 pints group, we find 
that the difference between men and women is —21.875. If we plotted these two values as 
a new graph we’d get a line connecting 6.25 to —21.875 (see the bottom left-hand side of 
Figure 12.2). This reflects the difference between the effect of gender after no alcohol com¬ 
pared to after 4 pints. We know that beta values represent gradients of lines, and in fact b 3 in 
our model is the gradient of this line (this is 6.25 — (—21.875) = 28.125). 

Let’s also see what happens if there isn’t an interaction effect: the right-hand side of Figure 
12.2 shows the same data except that the mean for the females who had 4 pints has been 
changed to 30. If we calculate the difference between men and women after no alcohol we 
get the same as before: 6.25. If we calculate the difference between men and women after 4 
pints we now get 5.625. If we again plot these differences on a new graph, we find a virtually 
horizontal line. So, when there’s no interaction, the line connecting the effect of gender after no 
alcohol and after 4 pints is flat and the resulting b 3 in our model would be close to 0 (remember 
that a zero gradient means a flat line). In fact its actual value would be 6.25—5.625 = 0.625. 



SELF-TEST 

s The file GogglesRegression.dat contains the 
dummy variables used in this example. Just to prove 
that all of this works, use this file and run a multiple 
regression on the data. 


The resulting table of coefficients is in Output 12.1. The important thing to note is 
that the beta value for the interaction (28.125) is the same as we’ve just calculated, which 
should hopefully convince you that factorial ANOVA is - as is everything, it would seem - 
just regression dressed up in a different costume. 

Coefficients: 



Estimate Std. 

Error 

t value 

Pr(>111 ) 

(Intercept) 

66.875 

3.055 

21.890 

< 2e-16 

gender 

-6.250 

4.320 

-1.447 

0.159 

alcohol 

-31.250 

4.320 

-7.233 

7.13e-08 

interaction 

28.125 

6.110 

4.603 

8.20e-05 


Output 12.1 

What I hope to have shown you in this example is how even complex ANOVAs are just 
forms of regression (a general linear model). You’ll be pleased to know (I’ll be pleased to 
know for that matter) that this is the last I’m going to say about ANOVA as a general linear 
model. I hope I’ve given you enough background so that you get a sense of the fact that we 
can just keep adding independent variables into our model. All that happens is these new 
variables just get added into a multiple regression equation with an associated beta value 
(just like the regression chapter). Interaction terms can also be added simply by multiplying 
the variables that interact. These interaction terms will also have an associated beta value. 
So, any ANOVA (no matter how complex) is just a form of multiple regression. 


12.4. Two-way ANOVA: behind the scenes <d 


Now that we have a good conceptual understanding of factorial ANOVA as an extension 
of the basic idea of a linear model, we will turn our attention to some of the specific 
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calculations that go on behind the scenes. The reason for doing this is that it should help 
you to understand what the output of the analysis means. 

Two-way ANOVA is conceptually very similar to one-way ANOVA. Basically, we still 
find the total sum of squared errors (SS T ) and break this variance down into variance that 
can be explained by the experiment (SS M ) and variance that cannot be explained (SS R ). 
However, in two-way ANOVA, the variance explained by the experiment is made up of not 
one experimental manipulation but two. Therefore, we break the model sum of squares 
down into variance explained by the first independent variable (SS A ), variance explained 
by the second independent variable (SS B ) and variance explained by the interaction of these 
two variables (SS Axg ) - see Figure 12.3. 


FIGURE 12.3 

Breaking down the 
variance in two- 
way ANOVA 


SS T 

Total Variability 


i 



SS M 



Variance Explained by the 



Experiment 


V 


J 


i 


SS R 

Unexplained 

Variability 



ss„ 

Variance 
Explained by 
Variable A 


*■ 


SS e 

Variance 
Explained by 
Variable B 

V_ W 



S^AxB 

Variance Explained 
by the 

Interaction of A and B 

v_ ¥ 


12.4.1. 


Total sums of squares (SS T ) <D 


We start off in the same way as we did for a one-way ANOVA. That is, we calculate how 
much variability there is between scores when we ignore the experimental condition from 
which they came. Remember from one-way ANOVA (equation (10.4)) that SS T is calcu¬ 
lated using the following equation: 

= ^( X , ~~ grand ) 
i=l 

= 4nd(N-l) 
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The grand variance is simply the variance of all scores when we ignore the group to which 
they belong. So if we treated the data as one big group it would look as follows: 


65 

50 

70 

45 

55 

30 

70 

55 

65 

60 

65 

30 

60 

80 

60 

85 

70 

30 

60 

65 

70 

65 

55 

55 

60 

70 

65 

70 

55 

35 

55 

75 

60 

70 

60 

20 

60 

75 

60 

80 

50 

45 

55 

65 

50 

60 

50 

40 


Grand mean = 58.33 


If we calculate the variance of all of these scores, we get 190.78 (try this on your calculator 
if you don’t trust me). We used 48 scores to generate this value, and so N is 48. As such 
the equation becomes: 


SS T =s^ nd (N-l) 

= 190.78(48-1) 
= 8966.66 


The degrees of freedom for this SS will be N — 1, or 47. 


12.4.2. 


The model sum of squares (SS M ) © 


The next step is to work out the model sum of squares. As I suggested earlier, this sum of 
squares is then further broken into three components: variance explained by the first inde¬ 
pendent variable (SS A ), variance explained by the second independent variable (SS B ) and 
variance explained by the interaction of these two variables (SS AxB ). 

Before we break down the model sum of squares into its component parts, we must first 
calculate its value. We know we have 8966.66 units of variance to be explained, and our 
first step is to calculate how much of that variance is explained by our experimental manipu¬ 
lations overall (ignoring which of the two independent variables is responsible). When we 
did one-way ANOVA we worked out the model sum of squares by looking at the difference 
between each group mean and the overall mean (see section 10.2.6). We can do the same 
here. We effectively have six experimental groups if we combine all levels of the two inde¬ 
pendent variables (three doses for the male participants and three doses for the females). 
So, given that we have six groups of different people we can then apply the equation for the 
model sum of squares that we used for one-way ANOVA (equation (10.5)): 

k 

ss m = 5>^-W 2 

n =1 
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The grand mean is the mean of all scores (we calculated this above as 58.33) and n is the 
number of scores in each group (i.e., the number of participants in each of the six experi¬ 
mental groups; eight in this case). Therefore, the equation becomes: 


SS M = 8(60.625 - 58.33) 2 + 8(66.875 - 58.33) 2 + 8(62.5 - 58.33) 2 +... 

+ 8(66.875 -58.33) 2 + 8(57.5 -58.33) 2 + 8(35.625 - 58.33) 2 
= 8(2.295) 2 + 8(8.545) 2 + 8(4.17) 2 + 8(8.545) 2 + 8(-0.83) 2 + 8(-22.705) 2 
= 42.1362 + 584.1362 + 139.1112 + 584.1362 + 5.5112 + 4124.1362 
= 5479.167 

The degrees of freedom for this SS will be the number of groups used, k, minus 1. We used 
six groups and so df — 5. 

At this stage we know that the model (our experimental manipulations) can explain 
5479.167 units of variance out of the total of 8966.66 units. The next stage is to further 
break down this model sum of squares to see how much variance is explained by our inde¬ 
pendent variables separately. 


12.4.2.1. The main effect of gender (SS A ) (D 

To work out the variance accounted for by the first independent variable (in this case, 
gender) we need to group the scores in the data set according to the gender to which they 
belong. So, basically we ignore the amount of drink that has been drunk, and we just place 
all of the male scores into one group and all of the female scores into another. So, the data 
will look like this (note that the first box contains the three female columns from our origi¬ 
nal table and the second box contains the male columns): 



Ay Female 


65 

70 

55 

70 

65 

65 

60 

60 

70 

60 

70 

55 

60 

65 

55 

55 

60 

60 

60 

60 

50 

55 

50 

50 


Mean Female = 60.21 



A 2 . Male 


50 

45 

30 

55 

60 

30 

80 

85 

30 

65 

65 

55 

70 

70 

35 

75 

70 

20 

75 

80 

45 

65 

60 

40 


Mean Male = 56.46 


We can then apply the equation for the model sum of squares that we used to calculate the 
overall model sum of squares: 

k 

SS A = I^(x,-x grand ) 2 

n =1 
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The grand mean is the mean of all scores (above) and n is the number of scores in each group 
(i.e., the number of males and females; 24 in this case). Therefore, the equation becomes: 

SSgender = 24 (60.21-58.33) 2 +24(56.46-58.33) 2 

= 24(1.88) 2 + 24(-1.87) 2 
= 84.8256 + 83.9256 
= 168.75 

The degrees of freedom for this SS will be the number of groups used, k, minus 1. We 
used two groups (males and females) and so df = 1. To sum up, the main effect of gender 
compares the mean of all males against the mean of all females (regardless of which alcohol 
group they were in). 

12.4.2.2. The main effect of alcohol (SSJ © 

To work out the variance accounted for by the second independent variable (in this case, 
alcohol) we need to group the scores in the data set according to how much alcohol was 
consumed. So, basically we ignore the gender of the participant, and we just place all of the 
scores after no drinks in one group, the scores after 2 pints in another group and the scores 
after 4 pints in a third group. So, the data will look like this: 


B v None 


65 

50 

70 

55 

60 

80 

60 

65 

60 

70 

55 

75 

60 

75 

55 

65 

Mean None = 

63.75 


B 2 . 2 Pints 

70 

45 

65 

60 

60 

85 

70 

65 

65 

70 

60 

70 

60 

80 

50 

60 

Mean 2 pints 

= 64.6875 


B 3 : 4 Pints 


55 

30 

65 

30 

70 

30 

55 

55 

55 

35 

60 

20 

50 

45 

50 

40 


Mean 4 pints = 46.5625 


We can then apply the same equation for the model sum of squares that we used for the 
overall model sum of squares and for the main effect of gender: 

k 

= ^ n k( X k ~ X grand) 

«= 1 

The grand mean is the mean of all scores (58.33 as before) and n is the number of scores 
in each group (i.e., the number of scores in each of the boxes above, in this case 16). 
Therefore, the equation becomes: 

SS alcohoI = 16(63.75 - 58.33) 2 +16(64.6875 - 58.33) 2 +16(46.5625 - 58.33) 2 
= 16(5.42) 2 +16(6.3575) 2 +16(-11.7675) 2 
= 470.0224 + 646.6849 + 2215.5849 
= 3332.292 

The degrees of freedom for this SS will be the number of groups used, k, minus 1 (see sec¬ 
tion 10.2.6). We used three groups and so df = 2. To sum up, the main effect of alcohol 
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compares the means of the no-alcohol, 2-pints and 4-pints groups (regardless of whether 
the scores come from men or women). 

12.4.2.3. The interaction effect (SS, D ) © 

The final stage is to calculate how much variance is explained by the interaction of the two 
variables. The simplest way to do this is to remember that the SS M is made up of three com¬ 
ponents (SS A , SS 5 and SS AxB ). Therefore, given that we know SS A and SS B we can calculate 
the interaction term using subtraction: 

SS/txB = ss M ~ SS A - SS B 

Therefore, for these data, the value is: 


SS 


AxB 


- SS M SS A SS E 


= 5479.167 -168.75 - 3332.292 
= 1978.125 


The degrees of freedom can be calculated in the same way, but are also the product of the 
degrees of freedom for the main effects (either method works): 

dfA xB = dfu ~ dfA ~ dfs df AxB = df A x df B 

=5-1-2 =1x2 

= 2 =2 


12.4.3. 


The residual sum of squares (SS D ) © 

K 


The residual sum of squares is calculated in the same way as for one-way ANOVA (see section 
10.2.7) and again represents individual differences in performance or the variance that can’t 
be explained by factors that were systematically manipulated. We saw in one-way ANOVA 
that the value is calculated by taking the squared error between each data point and its cor¬ 
responding group mean. An alternative way to express this was as (see equation (10.7)): 

SS R = 'Ls\{n k -1) 

= S groupl( W l - l) + S group2( W 2 “ 1) + S group3( W 3 ~ 1) + ■ • • + Sg roup „ («„ ~ 1) 

So, we use the individual variances of each group and multiply them by one less than 
the number of people within the group (n). We have the individual group variances in our 
original table of data (Table 12.1) and there were eight people in each group (therefore, 
n = 8) and so the equation becomes: 

SSr = ^group 1 ( W 1 _ 1) + S group2 ( W 2 _ 1) + S gr° u P 3 ( W 3 ~ 1) + S gr° u P 4 “ 1) + ' • • 

+ S group5( W 5 “ 1) + S group6( W 6 ~ 1) 

= 24.55(8-1)+106.7(8-1) +42.86(8-1)+ 156.7(8-1)+ 50(8-1)+ 117.41(8-1) 

= (24.55x7)+ (106.7x7) +(42.86x7) +(156.7x7) +(50x7)+(117.41x7) 

= 171.85 + 746.9 + 300 + 1096.9 + 350 + 821.87 

= 3487.57 
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The degrees of freedom for each group will be one less than the number of scores per group (i.e., 
7). Therefore, if we add the degrees of freedom for each group, we get a total of 6 x 7 = 42. 


12 . 4 . 4 . 


The F-ratios © 


Each effect in a two-way ANOVA (the two main effects and the interaction) has its own 
T-ratio. To calculate these we have to first calculate the mean squares for each effect by 
taking the sum of squares and dividing by the respective degrees of freedom (think back to 
section 10.2.8). We also need the mean squares for the residual term. So, for this example 
we’d have four mean squares calculated as follows: 


MS 

MS 

MS 

MS 


SS^ 

df A 

SS R 


df B 

$$AxB 


168.75 


1 


= 168.75 


3332.292 


2 

1978.125 


= 1666.146 


AxB 


df, 


= 989.062 


AxB 


SSr 


3487.52 

42 


= 83.036 


The F-ratios for the two independent variables and their interactions are then calculated 
by dividing their mean squares by the residual mean squares. Again, if you think back to 
one-way ANOVA this is exactly the same process. 


Fa = 


MS^ 

MS* 


p _ M $b 
Fr — 


ms r 


168.75 
83.036 
1666.146 
83.036 


= 2.032 


= 20.065 


F A xB ~ ' 


MS 


AxB 


ms r 


989.062 

83.036 


= 11.911 


Each of these F-ratios can be compared against critical values (based on their degrees of free¬ 
dom, which can be different for each effect) to tell us whether these effects are likely to reflect 
data that have arisen by chance, or reflect an effect of our experimental manipulations (these 
critical values can be found in the Appendix). If an observed F exceeds the corresponding criti¬ 
cal values then it is significant. R will calculate each of these F-ratios and their exact significance, 
but what I hope to have shown you in this section is that two-way ANOVA is basically the same 
as one-way ANOVA except that the model sum of squares is partitioned into three parts: the 
effect of each of the independent variables and the effect of how these variables interact. 


12.5. Factorial ANOVA using R © 


12 . 5 . 1 . 


Packages for factorial ANOVA in R © 


If you’re using commands (which we recommend), then you will need the packages car 
(for Levene’s test), compute.es (for effect sizes), ggplot2 (for graphs), multcomp (for post 
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hoc tests), pastecs (for descriptive statistics), reshape (for reshaping the data) and WRS (for 
robust tests). If you do not have these packages installed (some should be installed from 
previous chapters), you can install them by executing the following commands: 

install.packages("car"); install.packages("compute.es"); install.packages 
("ggplotZ");install.packages("multcomp");install.packages("pastecs");install. 
packages("reshape"); install.packages("WRS", repos="http://R-Forge.R-project. 
org") 

You then need to load these packages by executing these commands: 

library(car) ; libraryCcompute.es); library(ggplot2); library(multcomp); 
libraryCpastecs); library(reshape); library(WRS) 


12 . 5 . 2 . 


General procedure for factorial ANOVA © 


To conduct factorial ANOVA you should follow this general procedure: 

1 Enter data: you’ve probably gathered this much by now. 

2 Explore your data: as always, we’ll begin by graphing the data and computing descrip¬ 
tive statistics. You should check distributional assumptions and use Levene’s test to 
check for homogeneity of variance (see Chapter 5). 

3 Construct or choose contrasts: you need to decide what contrasts to do and to specify 
them appropriately for all of the independent variables in your analysis. If you want 
to use Type III sums of squares, these contrasts must be orthogonal. 

4 Compute the ANOVA: you can then run the main analysis of variance. Depending on 
what you found in the previous step, you might need to run a robust version of the 
test. 

5 Compute contrasts or post hoc tests: having conducted the main ANOVA, you can 
follow it up with post hoc tests or look at the results of your contrasts. Again, the 
exact methods you choose will depend upon what you unearth in step 2. 

We will work through these steps in turn. 


12 . 5 . 3 . 


Factorial ANOVA using R Commander © 


Running factorial ANOVA using commands gives you much more versatility than R 
Commander. However, you can do a basic factorial ANOVA using R Commander. First 
load the data from the file goggles.csv by using the Data=>Import data=>from text file, 
clipboard, or URL... menu (see section 3.7.3). Note that this file is a comma-separated 
(not a tab-delimited) file. This data set has three variables: gender, which is entered as text 
(‘Male’ and ‘Female’), alcohol, which is also entered as text (‘None’, ‘2 Pints’ and ‘4 Pints’), 
and attractiveness, which is the outcome variable. I have called the dataframe gogglesData. 
Note that because gender and alcohol contained text strings, rather than numbers, R has 
assumed that these variables are factors. 

We can explore the data by getting some descriptive statistics and testing the assump¬ 
tions. This is explained in Chapter 5. Levene’s test looks at whether variances across con¬ 
ditions are equal. Use the Statistics=>Variances=>Levene’s test... menu to run the analysis 
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as in Chapters 5 and 10. You will need to run separate tests for alcohol and gender (as you 
will see, by using commands we can also run the test for the interaction of these variables). 

To do the ANOVA, use the Statistics=>Means=4>Multi-way ANOVA... menu. The result¬ 
ing dialog box is fairly self-explanatory (Figure 12.4). You need to enter a name for the 
model that you’re going to create (I have chosen gogglesModel ) in the box labelled Enter 
name for model, select any factors from the list labelled Factors (in this case we have two 
factors, alcohol and gender) and select the outcome variable (in this case attractiveness) 
from the list labelled Response Variable. You cannot do planned comparisons or post hoc 
tests using this menu. Click on <« to run the analysis. The resulting output is described 
in sections 12.5.8. Note that R Commander uses Type II sums of squares when computing 
a factorial ANOVA, which may or may not be what you want (see Jane Superbrain Box 
11.1 in the previous chapter). 




FIGURE 12.4 

Factorial-way 
ANOVA using R 
Commander 


12 . 5 . 4 . 


Entering the data © 


The data for the example can be found in the file goggles.csv. You can load this data file by 
setting your working directory and executing: 

gogglesData<-read.csv("goggles.csv", header = TRUE) 



Note that we have used the read.csvQ function because the data are stored in a comma- 
separated values file (.csv). If we look at the data in R we will see that levels of the between- 
group variables have been entered in single columns. 



gender 

alcohol 

attractiveness 

1 

Female 

None 

65 

2 

Female 

None 

70 

3 

Female 

None 

60 

4 

Female 

None 

60 

5 

Female 

None 

60 

6 

Female 

None 

55 

7 

Female 

None 

60 

8 

Female 

None 

55 
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9 

Female 

2 

Pints 

70 

10 

Female 

2 

Pints 

65 

11 

Female 

2 

Pints 

60 

12 

Female 

2 

Pints 

70 

13 

Female 

2 

Pints 

65 

14 

Female 

2 

Pints 

60 

15 

Female 

2 

Pints 

60 

16 

Female 

2 

Pints 

50 

17 

Female 

4 

Pints 

55 

18 

Female 

4 

Pints 

65 

19 

Female 

4 

Pints 

70 

20 

Female 

4 

Pints 

55 

21 

Female 

4 

Pints 

55 

22 

Female 

4 

Pints 

60 

23 

Female 

4 

Pints 

50 

24 

Female 

4 

Pints 

50 

25 

Male 


None 

50 

26 

Male 


None 

55 

27 

Male 


None 

80 

28 

Male 


None 

65 

29 

Male 


None 

70 

30 

Male 


None 

75 

31 

Male 


None 

75 

32 

Male 


None 

65 

33 

Male 

2 

Pints 

45 

34 

Male 

2 

Pints 

60 

35 

Male 

2 

Pints 

85 

36 

Male 

2 

Pints 

65 

37 

Male 

2 

Pints 

70 

38 

Male 

2 

Pints 

70 

39 

Male 

2 

Pints 

80 

40 

Male 

2 

Pints 

60 

41 

Male 

4 

Pints 

30 

42 

Male 

4 

Pints 

30 

43 

Male 

4 

Pints 

30 

44 

Male 

4 

Pints 

55 

45 

Male 

4 

Pints 

35 

46 

Male 

4 

Pints 

20 

47 

Male 

4 

Pints 

45 

48 

Male 

4 

Pints 

40 


These data were originally entered in Excel, and as you can see we need two different cod¬ 
ing variables to represent gender and alcohol consumption. Therefore, in Excel, I created 
a variable called gender into which I typed ‘Female’ or ‘Male’; because I have used words 
rather than numbers, when R imports the data it guesses that this variable is a factor (i.e., 
we don’t need to explicitly convert it to a factor like we would had I used numbers to rep¬ 
resent males and females). R will code this factor with the levels in alphabetical order (so, 
females will be level 1 and males level 2 of gender, which coincidentally is the same order 
as in the data file). 

Next, I created a variable called alcohol and entered ‘None’, ‘2 Pints’ or ‘4 Pints’. Again, 
R guesses that this variable is a factor when it imports the data, and organizes the levels of 
this variable alphabetically. The alphabetic ordering means that R has imported this factor 
with the groups ordered as ‘2 Pints’, ‘4 Pints’ and ‘None’. This is because numbers (e.g., 2 
and 4) are deemed to come before letters in the alphabet. Ideally, we might like the groups 
to be ordreed as they are in the data (i.e., ‘None’, ‘2 Pints’ and ‘4 Pints’). To reorder the 
groups, we can use the levels option of the factor() function. All we need to do is type the 
levels in the order that we want them. So, by executing: 
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gogglesData$alcohol<-factor(gogglesData$alcohol, levels = c("None", "2 
Pints", "4 Pints")) 

we take the variable alcohol from the gogglesData dataframe, and we reorder the levels of 
the factor as ‘None’, ‘2 Pints’ and ‘4 Pints’ (levels = c(“None”, “2 Pints”, “4 Pints”)). 

You can see from the data that there are 24 females followed by 24 males, and within 
these groups there are 8 people who had no alcohol, 8 who had two pints and 8 who 
consumed four pints. Finally, I created a variable called attractiveness into which I put the 
scores (out of 100) representing the attractiveness of the each participant’s date. 

If we wanted to enter the data directly into R, we would need to assign group codes for 
the gender and alcohol variables. We might code gender as 1 for females and 2 for males, 
and we might code alcohol as no alcohol = 1, 2 pints = 2 and 4 pints = 3. The way this 
coding works is as follows: 


Gender 

Alcohol 

Participant was 

1 

1 

Male who consumed no alcohol 

1 

2 

Male who consumed 2 pints 

1 

3 

Male who consumed 4 pints 

2 

1 

Female who consumed no alcohol 

2 

2 

Female who consumed 2 pints 

2 

3 

Female who consumed 4 pints 


We can create these two coding variables very quickly by using th egl() function (Chapter 3). 
Remember that this function takes the general form: 

factorc-glCnumber of levels, cases in each level, total cases, labels = 
cC'Tabell", "label2"...)) 

This function creates a factor variable called factor ; you specify the number of levels or 
groups of the factor, how many cases are in each level/group, optionally the total number 
of cases (the default is to multiply the number of groups by the number of cases per group), 
and you can also use the labels option to list names for each level/group. For gender, we 
want 24 females followed by 24 males, so we can specify it as: 

gender<-gl(2, 24, labels = c("Female", "Male")) 

The numbers in the function tell R that we want 2 groups of 24 cases, the labels option 
then specifies the names to attach to these two groups. To create the alcohol variable we 
want 3 groups that each contain 8 cases. This will create 24 cases (3x8= 24), or, put 
another way, it will create the codes for the first gender group (i.e., females). However, we 
want this pattern to be repeated for the second gender group also; we can do this by adding 
a third value to the function that is the total number of cases (i.e., 48). By specifying the 
total number of cases, the gl() function will repeat the pattern of 24 codes until it reaches 
this total number of cases - in other words if we specify 48 as the limit, it will repeat the 
pattern twice. 

alcohol<-gl(3, 8, 48, labels = c("None", "2 Pints", "4 Pints")) 

We can add the attractiveness values by creating a numeric variable in the usual way: 

attractiveness<-c(65,70,60,60,60,55,60,55,70,65,60,70,65,60, 

60,50,55,65,70,55,55,60,50,50,50,55,80,65,70,75,75,65,45,60,85,65,70, 

70,80,60,30,30,30,55,35,20,45,40) 

Finally, we can merge these variables into a dataframe called gogglesData by executing: 
gogglesDatac-data.frame(gender, alcohol, attractiveness) 
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12 . 5 . 5 . 


Exploring the data <D 


As ever, we’ll look at some graphs first. Let’s start with the means across the different 
conditions. 




SELF-TEST 

s Use ggplot2 to plot a line graph (with error bars) 
of the attractiveness of the date with alcohol 
consumption on the x-axis and different-coloured 
lines to represent males and females. 


The resulting plot (shown later in the chapter in Figure 12.8) is what is known as an inter¬ 
action graph. These graphs are useful for interpreting significant interaction effects (should 
the analysis throw one up). 

We can also look at boxplots for attractiveness scores for men and women at each level 
of alcohol consumption. 




SELF-TEST 

s Use ggplot2 to plot boxplots of the attractiveness of 
the date at each level of alcohol consumption on the 
x-axis and different panels to represent males and 
females. 


Figure 12.5 shows boxplots for these data. For females, the median score (the horizontal 
line in the middle of each box) does not change much across the doses of alcohol, and also 
the spread of their scores is relatively narrow; however, for males, the spread of scores is 
wider than for females, and the median attractiveness seems to fall dramatically after 4 pints. 


FIGURE 12.5 

Boxplots of the 
beer-goggles data 
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We have used the by() and stat.desc() functions before to get descriptive statistics for 
separate groups (see section 10.6.5 for more detail). Therefore, if we wanted to explore 
the effects of alcohol and gender on the attractiveness of the dates selected, we could do so 
by executing separate commands: 

by(gogglesData$attractiveness, gogglesData$gender, stat.desc) 
by(gogglesData$attractiveness, gogglesData$alcohol, stat.desc) 

The resulting output is useful for interpreting the main effects of alcohol and gender 
on the attractiveness of mates. However, we are also interested in how these variables 
interact. This requires obtaining statistics for all combinations of alcohol and gender. 
To do this we need to use the listQ function to create a list of variables that we can 
then feed into the by() function. If, for example, we execute list(gogglesData$alcohol, 
gogglesData$gender) we create a list (just like a shopping list) that contains the variables 
alcohol and gender. If we place this list within the by() function, then we will get descrip¬ 
tive statistics for all combinations of levels of the variables within the list. To see what I 
mean, execute: 

by(goggIesData$attractiveness, Iist(goggIesData$aIcohoI, 
gogglesData$gender), stat.desc) 

The resulting (edited) output is in Output 12.2. Notice that the descriptive statistics 
are split by every combination of gender and alcohol, resulting in six different groups of 
information. So, for example, we can see that in the no-alcohol condition, males typically 
chatted up a female who was rated at about 67% on the attractiveness scale, whereas 
females selected a male who was rated as 61% on that scale. These means will be useful in 
interpreting the direction of any effects that emerge in the analysis. 

: None 
: Female 

median mean SE.mean Cl.mean.0.95 var std.dev coef.var 
60.000 60.625 1.752 4.143 24.554 4.9551 0.0817 


: 2 Pints 
: Female 

median mean SE.mean Cl.mean.0.95 var std.dev coef.var 

62.500 62.500 2.315 5.473 42.857 6.547 0.105 


4 Pints 
Female 

median mean SE.mean Cl.mean.0.95 var std.dev coef.var 

55.000 57.500 2.500 5.912 50.000 7.071 0.123 


: None 
: Male 

median mean SE.mean Cl.mean.0.95 var std.dev coef.var 

67.500 66.875 3.652 8.636 106.696 10.329 0.154 


: 2 Pints 
: Male 

median mean SE.mean Cl.mean.0.95 var std.dev coef.var 67.500 

66.875 4.426 10.465 156.696 12.518 0.187 


: 4 Pints 
: Male 

median mean SE.mean Cl.mean.0.95 var std.dev coef.var 

32.500 35.625 3.831 9.059 117.411 10.836 0.304 


Output 12.2 
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The final thing to do at this stage is to compute Levene’s test (see Chapter 5 and section 
10.3.1). We can again use the leveneTestQ function from the car package here. If we want 
to do a Levene’s test to see whether the variance in attractiveness differs across different 
gender and alcohol groups seperately, we can simply execute: 

leveneTest(gogglesData$attractiveness, gogglesData$gender, center = median) 
leveneTest(gogglesData$attractiveness, gogglesData$alcohol, center = 
median) 

However, as with the descriptive statistics, we’re primarily interested in the interaction 
of these variables, so we would ideally like to know whether the variances differ across 
all six groups (not just the two gender groups and three alcohol groups). To do this, we 
can add the interactionQ option to the leveneTestQ function, which will compute Levene’s 
test across any combination of groups for the variables specified within interaction(). In 
this case, we want to know whether the variances differ across all six groups that result 
from the combination of gender and alcohol (i.e., female_none, female_2 pints, female_4 
pints, male_none, male_2 pints, male_4 pints). Therefore, we specify both variables within 
interactionQ , that is, interaction(gogglesData$alcohol, gogglesData$gender). The resulting 
command that we need to execute is therefore: 

leveneTest(gogglesData$attractiveness, interaction(gogglesData$alcohol, 
gogglesData$gender), center = median) 

Output 12.3 shows the results of Levene’s test. We have encountered Levene’s test 
numerous times before, so you should know that it tests whether there are any significant 
differences between group variances and so a non-significant result like the one we have 
here, F(5, 42) = 1.425, p = .235, is indicative of the assumption being met. 

Levene's Test for Homogeneity of Variance 
Df F value Pr(>F) 
group 5 1.4252 0.2351 

42 

Output 12.3 


12 . 5 . 6 . 


Choosing contrasts © 


We saw in Chapter 10 that it’s useful to follow up ANOVA with contrasts that break 
down the main effects and tell us where the differences between groups lie. For one-way 
ANOVA, we entered codes that define the contrasts we want to do. We can follow the 
same procedure for factorial ANOVA except that we have to define contrasts for all of the 
independent variables. One very important consideration here is that if we want to look at 
Type III sums of squares (see Jane Superbrain Box 11.1) then we must use an orthogonal 
contrast for these sums of squares to be computed correctly. 

We encountered an orthogonal contrast in Table 10.6: the Helmert contrast. This con¬ 
trast will give you what you want in many different situations; however, if it doesn’t and 
you want to define your own contrasts then this can be done in the same way as we dis¬ 
cussed in Chapter 10 (see Oliver Twisted). 

The effect of gender has only two levels, so we could code an orthogonal contrast as 
simply —1 (females) and 1 (males). Remember that when we code contrasts anything with 
a positive sign is compared to anything with a negative sign, so this contrast will compare 
males to females. 
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OLIVER TWISTED 

Please Sir, can I have some 
more ... contrasts? 


‘This example is too similar to the one in Chapter 10’, sulks Oliver 
as he stamps his feet on the floor. 'It smells of rotting cabbage.’ 
I think actually, Oliver, the stench of rotting cabbage is probably 
because you stood your Dickensian self under a window when 
someone emptied his or her toilet bucket into the street. On the web¬ 
site I’ve prepared a different (slightly more complicated) example of 
how to specify your own contrasts to give you a bit more practice. 


The effect of alcohol has three levels: none, 2 pints and 4 pints. The no-alcohol group 
is a control, so, following the advice from Chapter 10, our first contrast might compare 
the no-alcohol group to the remaining categories (that is, all of the groups that had some 
alcohol). We need a second contrast then to separate the two alcohol groups. The resulting 
codes are in Table 12.3; this scenario is basically the same as the Viagra data in Chapter 10 
so reread that chapter if you don’t understand the values in the table. 


Table 12.3 

Orthogonal contrasts for the alcohol variable 


Group 

Contrast 1 

Contrast 2 

No Alcohol 

-2 

0 

2 Pints 

1 

-1 

4 Pints 

1 

1 


Setting contrasts for the two variables will also produce parameter estimates for the 
interaction term. So, in this case, we’ll get not only a contrast comparing no alcohol to the 
combined effect of 2 and 4 pints, but also one that tests whether this effect is different in 
men and women. Similarly, contrast 2 tests whether the 2- and 4-pints groups differ, but 
we will also get a parameter estimate that tests whether the difference between the 2- and 
4-pints groups is affected by the gender of the participant. To set the orthogonal contrasts 
we execute: 

contrasts(gogglesData$alcohol)<-cbind(c(-2, i > c (0, _i > i)) 

contrasts(gogglesData$gender)<-c(-l, 1) 

The first command sets the two contrasts for alcohol, just as we did in Chapter 10; the 
second sets a single contrast for gender. We can check that we have set the contrast cor¬ 
rectly by executing the name of the variable and looking at the contrast attribute: 

> gogglesData$alcohol 

attr(,"contrasts") 

[, 1 ] 1 , 2 ] 

None -2 0 

2 Pints 1 -1 

4 Pints 1 1 

Levels: None 2 Pints 4 Pints 


> gogglesData$gender 
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attr(,"contrasts") 

[, 1 ] 

Female -1 
Male 1 

Levels: Female Male 

Remembering that positive numbers are compared with negative and a zero means that 
the group is not involved at all, we can see that for alcohol we have set the first contrast 
to compare ‘none’ with the 2- and 4-pints groups (combined) and a second contrast that 
ignores the no-alcohol group and compares only the 2-pints against the 4-pints group. 


12 . 5 . 7 . 


Fitting a factorial ANOVA model © 


To create a factorial ANOVA model we can use the aov() function that we have used in 
the previous two chapters (see section 10.6.6.1). Remember that the aov() function is just 
the lm() function in disguise, so we can use what we learnt in Chapter 7 to add new vari¬ 
ables into our ANOVA model. Remember, that to add a predictor, we simply write ‘+ vari- 
ableName’ into the model. In the current model we wish to predict attractiveness scores 
from both gender and alcohol so our model is simply ‘attractiveness — gender + alcohol’, 
isn’t it? Actually, it’s not, because we also need to include the interaction term. To specify an 
interaction term we link variable names with a colon. For example, the interaction of gender 
and alcohol would be written in R as gender:alcohol (or indeed alcohol:gender, it doesn’t 
matter). Therefore, to specify the model including the interaction term, we could execute: 

gogglesModel<-aov(attractiveness ~ gender + alcohol + gender:alcohol, data 
= gogglesData) 

This command creates a model called gogglesModel , which includes the two independent 
variables and their interaction. 

The above method is good because it makes very explicit the predictors in the model 
(and is a useful reminder that we’re simply using a linear model, as we have throughout the 
book so far). However, there is a quicker method. You can include two variables and their 
interactions in a model by specifying variablel *variable2 as the predictor. Doing so will 
enter not just the interaction but also the effects of the individual variables as well. So, for 
example, this command: 

gogglesModel<-aov(attractiveness ~ alcohol*gender, data = gogglesData) 

does exactly the same thing as the previous command (see R’s Souls’ Tip 12.1). 

We had a fairly lengthy discussion about sums of squares in the previous chapter (see 
Jane Superbrain Box 11.1) and I refer you back there if what I’m about to say doesn’t make 
any sense. If we want to look at the Type III sums of squares for the model, we need to also 
execute this command after we have created the model: 

AnovaCgogglesModel, type="III") 

This takes the model that we have just created (gogglesModel ) but, rather than displaying 
the Type I sums of squares (the default), it will show us the Type III sums of squares. 


12 . 5 . 8 . 


Interpreting factorial ANOVA © 


Output 12.4 tells us whether any of the independent variables have had an effect on the 
dependent variable. The important things to look at in the table are the significance values 
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Specifying more complex designs © 


It follows that if you have three independent variables then you can simply add the third variable into the model in 
the same way. For example, if we had also measured whether the lighting at the club was dim or bright (which 
would affect how well you could see your date), then we could specify the model as: 


gogglesModel<-aov(attractiveness ~ gender*alcohol*lighting, data = gogglesData) 


Note that we have used ‘gender*alcohol*lighting’ as the predictors, which will add in the three main effects 
but also all of the interactions between these variables. 


of the independent variables. The first thing to notice is that there is a significant main 
effect of alcohol (because the significance value is less than .05). The F-ratio is highly 
significant, indicating that the amount of alcohol consumed significantly affected whom 
the participant would try to chat up. This means that overall, when we ignore whether the 
participant was male or female, the amount of alcohol influenced their mate selection. The 
best way to see what this means is to look at a bar chart of the average attractiveness at each 
level of alcohol (ignore gender completely). This graph displays the means in Output 12.2 
that we calculated in section 12.4.2.2. 

Anova Table (Type III tests) 


Response: attractiveness 

Sum Sq Df F value Pr(>F) 


(Intercept) 

gender 

alcohol 

gender:alcohol 
Residuals 


163333 1 1967 

169 1 2 

3332 2 20 

1978 2 11 

3488 42 


0251 < 2.2e-16 
0323 0.1614 
0654 7.649e-07 
9113 7.987e-05 


* * * 




* * * 


Output 12.4 



SELF-TEST 

s Plot error bar graphs of the main effects of alcohol 
and gender 



Figure 12.6 clearly shows that when you ignore gender the overall attractiveness of the 
selected mate is very similar when no alcohol has been drunk and when 2 pints have been 
drunk (the means of these groups are approximately equal). Hence, this significant main 
effect is likely to reflect the drop in the attractiveness of the selected mates when 4 pints 
have been drunk. This finding seems to indicate that a person is willing to accept a less 
attractive mate after 4 pints. 

Output 12.4 also tells us about the main effect of gender. This time the F-ratio is not sig¬ 
nificant (p = .161, which is larger than .05). This effect means that overall, when we ignore 
how much alcohol had been drunk, the gender of the participant did not influence the 
attractiveness of the partner that the participant selected. In other words, other things being 
equal, males and females selected equally attractive mates. The bar chart (that you have 
hopefully produced from the self-test) of the average attractiveness of mates for men and 
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FIGURE 12.6 

Graph showing 

the main effect of 80 - 

alcohol 



None 2 Pints 4 Pints 

Alcohol Consumption 


women (ignoring how much alcohol had been consumed) reveals the meaning of this main 
effect. Figure 12.7 plots the means in Output 12.2 that we calculated in section 12.4.2.1. 
This graph shows that the average attractiveness of the partners of male and female partici¬ 
pants was fairly similar (the means are different by only 4%). Therefore, this non-significant 
effect reflects the fact that the mean attractiveness was similar. We can conclude from this 
that, other things being equal, men and women chose equally attractive partners. 

Finally, Output 12.4 tells us about the interaction between the effect of 
gender and the effect of alcohol. The T-value is highly significant (because the p-value 
is less than .05). What this actually means is that the effect of alcohol on mate selec¬ 
tion was different for male participants than it was for females. In the presence of 
this significant interaction it makes no sense to interpret the main effects. Figure 12.8 
shows the plot that we produced earlier as a self-test task; this graph tells us some¬ 
thing about the nature of this interaction effect. 

Figure 12.8 shows that for women, alcohol has very little effect: the attractive¬ 
ness of their selected partners is quite stable across the three conditions (as shown 
by the near-horizontal line). However, for the men, the attractiveness of their part¬ 
ners is stable when only a small amount has been drunk, but rapidly declines when 4 pints 
have been drunk. Non-parallel lines usually indicate a significant interaction effect. In this 
particular graph the lines actually cross, which indicates a fairly large interaction between 
independent variables. The lines tell us that alcohol has little effect on mate selection until 
4 pints have been drunk and that the effect of alcohol is prevalent only in male participants. 
In short, the results show that women maintain high standards in their mate selection 
regardless of alcohol, whereas men have a few beers and then try to get off with anything 
on legs. One interesting point that these data demonstrate is that we earlier concluded that 
alcohol significantly affected how attractive a mate was selected (the alcohol main effect); 
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80 - 


70 - 



Female Male 

Gender 


80 - 



20 - 


None 2 Pints 4 Pints 

Alcohol Consumption 


however, the interaction effect tells us that this is true only in males (females appear unaf¬ 
fected). This shows why main effects should not be interpretted when a significant interac¬ 
tion involving those main effects exists. 


FIGURE 12.7 

Graph to show 
the main effect of 
gender on mate 
selection 


FIGURE 12.8 

Graph of the 
interaction of 
gender and alcohol 
consumption in 
mate selection 
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12 . 5 . 9 . 


Interpreting contrasts <D 


To see the output for the contrasts that we specified, execute: 
summary.ImCgogglesModel) 

Doing so will display the parameter estimates for the model (Output 12.5). Let’s look at 
each effect in the analysis in turn: 

• genderl: This is the contrast for the main effect of gender; because gender has only 
two groups this is the same as the effect of gender from Output 12.4. (Quite literally, 
in fact: the t- and F-statistics are directly related by F = t 1 . Our t-value for this con¬ 
trast is —1.426, and the value of F for the effect of gender is —1.426 2 = 2.03). 

• alcoholl: This contrast compares the no-alcohol group to the two alcohol groups. 
This tests whether the mean of the no-alcohol group (63.75) is different than the 
mean of the 2-pints and 4-pints groups combined ((64.69 + 46.56)/2 = 55.625). This 
is a difference of-8.125 (55.63 — 63.75). As explained in Chapter 10, the estimate 
for this difference is this difference divided by the number of groups involved in the 
contrast (-8.125/3 = —2.708). The p-value is .006, which is smaller than .05, indicat¬ 
ing a significant difference. So we could conclude that the effect of alcohol is that any 
amount of alcohol reduces the attractiveness of the dates selected compared to when 
no alcohol is drunk. Of course this is misleading because, in fact, the means for the no¬ 
alcohol and 2-pints groups are fairly similar (63.75 and 64.69), so 2 pints of alcohol 
don’t reduce the attractiveness of selected dates. The comparison is significant because 
it’s testing the combined effect of 2 and 4 pints; 4 pints has such a drastic effect that it 
drags down the overall mean. This example shows why you need to be careful about 
how you interpret contrasts: you need to have a look at the next contrast as well. 

• alcohol2: This contrast tests whether the mean of the 2-pints group (64.69) is dif¬ 
ferent than the mean of the 4-pints group (46.56). This is a difference of —18.13 
(46.56 - 64.69); as explained in Chapter 10, the estimate is this value divided by the 
number of groups involved in the contrast (-18.13/2 = 9.06). The p-value is .000, 
which is smaller than .05, and therefore indicates a significant difference between the 
groups. We can conclude that having 4 pints significantly reduced the attractiveness 
of selected dates compared to having only 2 pints. 

• genderl :alcoholl: This contrast tests whether the effect of alcoholl described above 
is different in men and women. It answers the question: is the effect of alcohol com¬ 
pared to no alcohol on the attractiveness of dates comparable in men and women? 
The p-value is .010, which is significant, so the answer is no, the extent to which 
alcohol vs. no alcohol has an effect on date attractiveness is different in men and 
women. Figure 12.9 (left) shows what this contrast is testing. The Alcohol’ group is 
the combined 2- and 4-pints groups. For the women, the difference in means between 
the no-alcohol group and the other groups combined is 60 - 60.625 = -0.625 (the 
line is flat, reflecting this small difference). For the men, the difference between the 
two means is 51.25 - 66.875 = -15.625 (the line for males on the graph slopes 
down, reflecting this decrease). This contrast tests whether -0.625 (the difference for 
females) is significantly different from 15.625 (the difference for males). In terms of 
the graph, it tests whether the lines for males and females have different slopes. 

• genderl :alcohol2: This contrast tests whether the effect of alcohol2 described above 
is different in men and women. It answers the question: is the effect of 2 pints com¬ 
pared to 4 pints on the attractiveness of dates comparable in men and women? The 
p-value is .000, which is significant, so the answer is no, the extent to which 2 vs. 4 
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pints has an effect on date attractiveness is different in men and women). Figure 12.9 
(right) shows what this contrast is testing. For the women, the difference in means 
between the 2- and 4-pints groups is 57.50 — 62.50 = -5 (the line slopes down 
slightly). For the men, the difference between the two means is 35.625 - 66.875 = 
-31.25 (the line for males on the graph slopes down much more than for females). 
This contrast tests whether -5 (the difference for females) is significantly different 
from -31.25 (the difference for males). In terms of the graph, it tests whether the 
lines for males and females have different slopes. 

Coefficients: 



Estimate Std. 

Error 

t value 

Pr(>|t|) 


(Intercept) 

58.333 

1.315 

44.351 

< 2e-16 

*** 

genderl 

-1.875 

1.315 

-1.426 

0.161382 


alcoholl 

-2.708 

0.930 

-2.912 

0.005727 

* * 

alcohol2 

-9.062 

1.611 

-5.626 

1.37e-06 

*** 

genderl:alcoholl 

-2.500 

0.930 

-2.688 

0.010258 

* 

genderl:alcohol2 

-6.562 

1.611 

-4.074 

0.000201 

*** 


Output 12.5 


Contrast 1 



30- 30- 

20 - 20 - 

No Alcohol Alcohol 


Contrast 2 



FIGURE 12.9 

Graphical displays 
of the contrasts 
for the beer- 
goggles data 


Gender 

Female 
• Male 


2 Pints 


4 Pints 


Alcohol Consumption 


Alcohol Consumption 


12 . 5 . 10 . 


Simple effects analysis <D 


A popular way to break down an interaction term is to use a technique called simple effects 
analysis. This analysis looks at the effect of one independent variable at individual levels of 
the other independent variable. So, for example, in our beer-goggles data we could do simple 
effects analysis looking at the effect of gender at each level of alcohol. This would mean tak¬ 
ing the average attractiveness of the date selected by men and comparing it to that for women 
after no drinks, then making the same comparison for 2 pints and then finally for 4 pints. 
Another way of looking at this is to say we would compare each black dot to the correspond¬ 
ing blue dot in Figure 12.8: based on the graph, we might expect to find no difference after 
no alcohol and after 2 pints (in both cases the black and blue dots are located in about the 
same position) but we would expect a difference after 4 pints (because the black and blue 
dots are quite far apart). The alternative way to do it would be to compare the mean attrac¬ 
tiveness after no alcohol, 2 pints and 4 pints for men and then in a separate analysis do the 
same but for women. (This would be a bit like doing a one-way ANOVA on the effect of alco¬ 
hol in men, and then doing a different one-way ANOVA for the effect of alcohol in women.) 
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OLIVER TWISTED 

Please Sir, can I have 
some more ... simple 
effects? 


‘I want to impress my friends by doing a simple effects analysis by hand’, 
boasts Oliver. You don’t really need to know how simple effects analyses 
are calculated to run them, Oliver, but seeing as you asked, it is explained 
in the additional material available from the companion website. 


FIGURE 12.10 

Schematic 
representation of 
the contrasts and 
codes for simple 
effects analysis on 
the goggles data 
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Simple effects analysis in 


R © 


Unfortunately, simple effects are not that easy to do in R. The first thing we need to do is create a variable in the 
dataframe that merges the variables of interest into a single factor. In other words, rather than have alcohol and 
gender as separate variables, we want a new variable that simply codes the six groups that result from com¬ 
bining all levels of alcohol and gender. We can do this using the gl() function to add a variable ( simple ) to the 
dataframe that is six groups each containing eight observations: 


gogglesData$simple<-gl(6,8) 


We can then use the factorf) function to specify labels for these six groups: 

gogglesData$simple<-factor(gogglesData$simple, levels = cCl:6), labels = c("F_ 

None","F_2pints", "F_4pints","M_None","M_2pints", "M_4pints")) 


The data now look like this (I’ve edited out cases to save space): 



gender 

alcohol 

alcohol2 

attractiveness 

simple 

i 

Female 


None 

No 

Alcohol 

65 

F_None 

2 

Female 


None 

No 

Alcohol 

70 

F_None 

9 

Female 

2 

Pints 


Alcohol 

70 

F_2pints 

10 

Female 

2 

Pints 


Alcohol 

65 

F_2pints 

17 

Female 

4 

Pints 


Alcohol 

55 

F_4pints 

18 

Female 

4 

Pints 


Alcohol 

65 

F_4pints 

25 

Male 


None 

No 

Alcohol 

50 

M_None 

26 

Male 


None 

No 

Alcohol 

55 

M_None 

33 

Male 

2 

Pints 


Alcohol 

45 

M_2pints 

34 

Male 

2 

Pints 


Alcohol 

60 

M_2pints 

47 

Male 

4 

Pints 


Alcohol 

45 

M_4pints 

48 

Male 

4 

Pints 


Alcohol 

40 

M_4pints 


Note that we have added the variable simple, which codes whether a person was male or female and how much 
alcohol they had in a single variable. 

Next, we create contrasts that break these six groups up using the standard rules for planned contrasts. 
Figure 12.10 shows how we would break the groups up into five contrasts to do a simple effects analysis of 
gender. The first contrast compares no alcohol to alcohol (2 or 4 pints combined). Remember that these two 
‘chunks’ of variation are made up of the different gender groups and so need to be broken down further. For 
example, the no-alcohol group is made up of the males that had no alcohol (‘0 M’) and the females that had no 
alcohol (‘0 F'), and the alcohol chunk contains the males and females that had 2 pints (‘2 M’ and ‘2 F') and the 
males and females that had 4 pints (‘4 M' and ‘4 F’). The second contrast breaks down the ‘alcohol’ chunk to 
compare 2 pints against 4 pints. Again, remember that both chunks at this stage are made up of the two corre¬ 
sponding gender groups. The third contrast takes the no-alcohol ‘chunk’ and compares the two gender groups 
contained within it. This contrast is the simple effect of gender when no alcohol was consumed. The fourth con¬ 
trast takes the 2-pint ‘chunk’ and breaks the variance down to compare the two gender groups contained within it. 
This contrast is the simple effect of gender when 2 pints were consumed. Finally, the fifth contrast takes the 4-pint 
‘chunk’ and compares the two gender groups contained within it. This contrast is the simple effect of gender 


(Continued) 
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(Continued) 

when 4 pints were consumed. If you look back to Chapter 10 you’ll see that these contrasts conform to the rules 
of orthogonal contrasts, and that the codes in Figure 12.10 specify the contrasts. 

To create these contrasts in R we can create five variables (one for each contrast) that contain the codes for 
the respective groups. (Bear in mind that in the dataframe the groups are ordered as: female none, female 2 
pints, female 4 pints, male none, male 2 pints, male 4 pints, and we have to order the codes accordingly.) I have 
also labelled the contrasts in a way that tells us something about what they represent: 

alcEffectl<-c(-2, 1, 1, -2, 1, 1) 
alcEffect2<-c(0, -1, 1, 0, -1, 1) 
gender_none<-c(-l, 0, 0, 1, 0, 0) 
gender_twoPint<-c(0, -1, 0, 0, 1, 0) 
gender_fourPint<-c(0, 0, -1, 0, 0, 1) 

To tidy things up lets merge these variables into an object called simpleEff-. 

simpleEff<-cbind(alcEffectl, alcEffect2, gender_none, gender_twoPint, gender_fourPint) 

We can now set the contrasts for the variable simple to be this object: 
contrasts(gogglesData$simple)<-simpleEff 

We then create a new model in which attractiveness is predicted from simple (which, remember, contains both 
the effects of alcohol and gender but coded so that the contrasts give us simple effects): 

simpleEffectModel<-aov(attractiveness ~ simple, data = gogglesData) 

To see the contrasts we use summary.lm() on the newly created model: 

summary.Im(simpleEffectModel) 

The resulting output contains the parameter estimates for the five contrasts. Looking at the significance values for 
each simple effect, it appears that there was no significant difference between men and women when they drank 
no alcohol, p = .177, or when they drank 2 pints, p = .34, but there was a very significant difference, p < .001, 
when 4 pints were consumed (which, judging from the interaction graph, reflects the fact that the mean for men 
is considerably lower than for women). 


Coefficients: 

Estimate Std. 

Error 

t value 

Pr(> t ) 

(Intercept) 

58.333 

1.315 

44.351 

LO 

tH 

1 

<D 

CM 

V 

simplealcEffeet1 

-2.708 

0.930 

-2.912 

0.00573 

simplealcEffeet2 

-9.062 

1.611 

-5.626 

1.37e-06 

simplegender_none 

3.125 

2.278 

1.372 

0.17742 

simplegender_twoPint 

2.188 

2.278 

0.960 

0.34243 

simplegender_fourPint 

-10.938 

2.278 

-4.801 

2.02e-05 


12.5.11. 


Post hoc analysis © 


The variable alcohol has three levels and so you might want to perform post hoc tests to see 
where the differences between groups lie. I want to stress again that the significant main 
effect of alcohol that we observed should not be interpreted given the significant interac¬ 
tion with gender. Therefore, I’m covering post hoc tests here for illustrative purposes: if 
this was a real piece of research I would focus on the interaction effect and not perform 
post hoc tests on alcohol. 
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We saw in Chapter 10 that we can specify Bonferroni post hoc tests using the 
pairwise.t.test() function and Tukey tests using glht(). Refer back to that chapter for details 
of these functions, but for the present example we could obtain post hoc tests for alcohol 
by executing either of these commands: 

pairwise.t.testCgogglesData$attractiveness, gogglesData$alcohol, p.adjust. 
method = "bonferroni") 

postHocs<-glht(gogglesModel, linfct = mcp(alcohol = "Tukey")) 

summary(postHocs) 

confint(postHocs) 

The resulting post hoc tests are shown in Outputs 12.6 (Bonferroni) and 12.7 (Tukey); 
they both break down the main effect of alcohol and can be interpreted as if a one-way 
ANOVA had been conducted on the alcohol variable (i.e., the reported effects for alcohol 
are collapsed with regard to gender). The Bonferroni and Tukey tests show the same pat¬ 
tern of results: when participants had drunk no alcohol or 2 pints of alcohol, they selected 
equally attractive mates. However, after 4 pints had been consumed, participants selected 
significantly less attractive mates than after both 2 pints (p < .001) and no alcohol (p < 
.001). It is interesting to note that the mean attractiveness of partners after no alcohol and 
2 pints was so similar that the probability of the obtained difference between those means 
is 1 (i.e., completely probable). 

Pairwise comparisons using t tests with pooled SD 
data: gogglesData$attractiveness and gogglesData$alcohol 

None 2 Pints 
2 Pints 1.00000 - 
4 Pints 0.00024 0.00011 


P value adjustment method: bonferroni 

Output 12.6 


Simultaneous Tests for General Linear Hypotheses 
Multiple Comparisons of Means: Tukey Contrasts 

Fit: aov(formula = attractiveness ~ gender + alcohol + gender:alcohol, 
data = gogglesData) 

Linear Hypotheses: 

Estimate Std. Error t value Pr(>|t|) 

2 Pints - None == 0 0.9375 3.2217 0.291 0.954 

4 Pints - None == 0 -17.1875 3.2217 -5.335 1.01e-05 *** 

4 Pints - 2 Pints == 0 -18.1250 3.2217 -5.626 < le-05 *** 

Signif. codes: 0 '***' 0.001 '**■ 0.01 '*' 0.05 0.1 ' ' 1 

(Adjusted p values reported -- single-step method) 

> confint(postHocs) 

Simultaneous Confidence Intervals 


Multiple Comparisons of Means: Tukey Contrasts 

Fit: aov(formula = attractiveness ~ gender + alcohol + gender:alcohol, 
data = gogglesData) 
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Quantile = 2.4303 

95% family-wise confidence level 


Linear Hypotheses: 






Estimate 

lwr 

upr 

2 

Pints 

- None == 

0 

0.9375 

-6.8921 

8.7671 

4 

Pints 

- None == 

0 

-17.1875 

-25.0171 

-9.3579 

4 

Pints 

- 2 Pints 

= = 

0 -18.1250 

-25.9546 

-10.2954 


Output 12.7 


12.5.12. 


Overall conclusions © 


In summary, we should conclude that alcohol has an effect on the attractiveness of selected 
mates. Overall, after a relatively small dose of alcohol (2 pints) humans are still in control 
of their judgements and the attractiveness levels of chosen partners are consistent with 
a control group (no alcohol consumed). However, after a greater dose of alcohol, the 
attractiveness of chosen mates decreases significantly. This effect is what is referred to as 
the ‘beer-goggles effect’. More interesting, the interaction shows a gender difference in 
the beer-goggles effect. Specifically, it looks as though men are significantly more likely to 
pick less attractive mates when drunk. Women, in comparison, manage to maintain their 
standards despite being drunk. What we still don’t know is whether women will become 
susceptible to the beer-goggles effect at higher doses of alcohol. 


12.5.13. 


Plots in factorial ANOVA © 


We saw in the previous two chapters that the aov() function automatically generates some 
plots that we can use to test the assumptions. We can see these graphs by executing: 

plot(gogglesModel) 

The results are in Figure 12.11. The first graph (on the left) can be used for testing homo¬ 
geneity of variance: if it has a funnel shape then we’re in trouble. The plot we have does 
show funnelling (the spread of scores is wider at some points than at others), which implies 
that the residuals might be heteroscedastic (a bad thing). The second plot (on the right) is 
a Q-Q plot (see Chapter 5), which tells us about the normality of residuals in the model. 
We want our residuals to be normally distributed, which means that the dots on the graph 
should hover around the diagonal line. On our plot this is the case, suggesting that we can 
assume normality of our residuals/errors. 


12.6. Interpreting interaction graphs © 


Interactions are very important, and the key to understanding them is being able to inter¬ 
pret interaction graphs. We’ve already had a look at one interaction graph when we inter¬ 
preted the analysis in this chapter. We used Figure 12.8 to conclude that the interaction 
probably reflected the fact that men and women chose equally attractive dates after no 
alcohol and 2 pints, but that at 4 pints men’s standards dropped significantly more than 
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Residuals vs Fitted 



Fitted values 

aov(attractiveness ~ gender + 
alcohol + gender:alcohol) 


Normal Q-Q 



Theoretical Quantiles 
aov(attractiveness ~ gender + 
alcohol + gender:alcohol) 


FIGURE 12.11 

Plots of the beer- 
goggles model 



CRAMMING SAM’S TIPS 


Two-way independent ANOVA 


• Two-way independent ANOVA compares several means when there are two independent variables and different participants 
have been used in all experimental conditions. For example, if you wanted to know whether different teaching methods 
worked better for different subjects, you could take students from four courses (Psychology, Geography, Management and 
Statistics) and assign them to either lecture-based or book-based teaching. The two variables are course and method of 
teaching. The outcome might be the end of year mark (as a percentage). 

• Test for homogeneity of variance using Levene’s test. If the p-value is less than .05 then the assumption is violated. 

• A ‘main effect’ is the effect of a variable in isolation, whereas an ‘interaction’ represents the combined effect of two or more 
variables. 

• In the main analysis you’ll get a summary table containing a main effect of each predictor variable and an effect of the inter¬ 
action between the two variables; if the p-value is less than .05 then the effect is significant. For main effects consult post 
hoc tests to see which groups differ, and for the interaction look at contrasts, an interaction graph or conduct simple effects 
analysis. If the interaction effect is significant it makes little sense to interpret or do further analysis on the main effects. 

• For post hoc tests, look at the p-value of each test to discover if your comparisons are significant (they will be if the signifi¬ 
cance value is less than .05). 

• Test the same assumptions as for one-way independent ANOVA (see Chapter 10). 


women’s. Imagine we’d got the profile of results shown in Figure 12.12; do you think we 
would’ve still got a significant interaction effect? 

This profile of data probably would also give rise to a significant interaction term 
because, although the attractiveness of men and women’s dates is similar after no alcohol 
and 4 pints of alcohol, there is a big difference after 2 pints. This reflects a scenario in 
which the beer-goggles effect is equally big in men and women after 4 pints (and doesn’t 
exist after no alcohol) but kicks in quicker for men: the attractiveness of their dates plum¬ 
mets after 2 pints, whereas women maintain their standards until 4 pints (at which point 
they’d happily date an unwashed skunk). Let’s try another example. Is there a significant 
interaction in Figure 12.13? 
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FIGURE 12.12 

Another interaction 
graph 
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FIGURE 12.13 

A ‘lack of’ 
interaction graph 
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For the data in Figure 12.13 there is unlikely to be a significant interaction because 
the effect of alcohol is the same for men and women. So, for both men and women, the 
attractiveness of their dates after no alcohol is quite high, but after 2 pints all types drop by 
a similar amount (the slope of the male and female lines is about the same). After 4 pints 
there is a further drop and, again, this drop is about the same in men and women (the 
lines again slope at about the same angle). The fact that the line for males is lower than for 
females just reflects the fact that across all conditions, men have lower standards than their 
female counterparts: this reflects a main effect of gender (i.e., males generally chose less 
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attractive dates than females at all levels of alcohol). There are two general points that we 
can make from these examples: 


• Non-parallel lines on an interaction graph imply significant interactions. However, 
it’s important to remember that this doesn’t mean that non-parallel lines automati¬ 
cally mean that the interaction is significant: whether the interaction is significant 
will depend on the degree to which the lines are not parallel. 

• If the lines on an interaction graph cross then obviously they are not parallel and this 
can give away a possible significant interaction. However, contrary to popular belief, 
it isn’t always the case that if the lines of the interaction graph cross then the interac¬ 
tion is significant. 


(a) 



Alcohol Consumption 


(b) 


FIGURE 12.14 



Gender 


Bar charts 
showing 
interactions 
between two 
variables 


(c) (d) 




A further complication is that sometimes people draw bar charts rather than line charts. 
Figure 12.14 shows some bar charts of interactions between two independent variables. 
Panels (a) and (b) actually display the data from the example used in this chapter (in fact, 
why not have a go at plotting them). As you can see, there are two ways to present the 
same data: panel (a) shows the data when levels of alcohol are placed along the x-axis 
and different-coloured bars are used to show means for males and females, and panel 
(b) shows the opposite scenario in which gender is plotted on the x-axis and different 
colours distinguish the dose of alcohol. Both of these graphs show an interaction effect. 
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What you’re looking for is for the differences between coloured bars to be different at 
different points along the x-axis. So, for panel (a) you’d look at the difference between 
the light and dark blue bars for no alcohol, and then look to 2 pints and ask, ‘Is the dif¬ 
ference between the bars different than when I looked at no alcohol?’ In this case the 
dark and light blue bars look the same at no alcohol as they do at 2 pints - hence, no 
interaction. However, we’d then move on to look at 4 pints, and we’d again ask, ‘Is the 
difference between the light and dark blue bars different than what it has been in any of 
the other conditions?’ In this case the answer is yes: for no alcohol and 2 pints, the light 
and dark blue bars were about the same height, but at 4 pints the dark blue bar is much 
higher than the light one. This shows an interaction: the pattern of responses changes at 
4 pints. Panel (b) shows the same thing but plotted the other way around. Again we look 
at the pattern of responses. So, first we look at the men and see that the pattern is that 
the first two bars are the same height, but the last bar is much shorter. The interaction 
effect is shown up by the fact that for the women there is a different pattern: all three 
bars are about the same height. 



SELF-TEST 

s What about panels (c) and (d): do you think there is 
an interaction? 


Again, they display the same data in two different ways, but it’s different data than what 
we’ve used in this chapter. First let’s look at panel (c): for the no-alcohol data, the dark 
bar is a little bit bigger than the light one; moving on to the 2-pints data, the dark bar is 
also a little bit taller than the light bar; and finally for the 4-pints data the dark bar is again 
higher than the light one. In all conditions the same pattern is shown - the dark blue bar 
is a bit higher than the light blue one (i.e., females pick more attractive dates than men 
regardless of alcohol consumption) - therefore, there is no interaction. Looking at panel 
(d), we see a similar result. For men, the pattern is that attractiveness ratings fall as more 
alcohol is drunk (the bars decrease in height) and then for the women we see the same pat¬ 
tern: ratings fall as more is drunk. This again is indicative of no interaction: the change in 
attractiveness due to alcohol is similar in men and women. 


12.7. Robust factorial ANOVA ® 


As with one-way ANOVA, Wilcox (2005) describes robust procedures for conducting fac¬ 
torial ANOVA. To access these we need to load the WRS package (see section 5.8.4.). There 
are four functions that we will look at: 

• t2way(): This performs a two-way independent ANOVA on trimmed means. 

• mcp2atm(): This performs post hoc tests for a two-way independent design based on 
trimmed means. 

• pbad2way(): This performs a two-way independent ANOVA using M-measures of 
location (e.g., the median) and a bootstrap. 

• mcp2a(): This performs post hoc tests for the above function. 






CHAPTER 12 FACTORIAL ANOVA (GLM 3) 


535 


The first problem we have is that these functions need the data to be in wide format 
rather than long (see Chapter 3). Figure 12.15 shows the existing data format (long) and 
how we need it to look (wide). Essentially we want levels of our two factors to be repre¬ 
sented in different columns. Therefore, rather than a dataframe with three columns and 48 
rows, we want one with six columns and eight rows. 

We could re-enter the data in the wide format (which is very tempting when you’ve spent 
half an hour trying to work out how to get R to restructure it for you), but we’re going to 
look at how to use melt() and cast() to do the restructuring for us. To get the restructuring 
to work, we need to add a variable to our dataframe that identifies the rows in the wide 
format. Notice in Figure 12.15 that the data are made up of six chunks that represent the 
combinations of gender and alcohol, and each chunk contains eight rows. We want to 
move these chunks from being stacked on top of each other to being beside each other. To 
do this, R needs to know what row a particular score will end up in when we move each 
block of scores from the stack into the columns. The easiest approach is simply to create a 
variable (called row) that identifies within each chunk the row number of a given score. In 
other words, it will be a value from 1 to 8 telling us whether the score is the first, second, 
third, etc. score within the chunk. At the moment, the chunks are stacked on top of each 
other so we want a variable that is the sequence of numbers 1 to 8 repeated for each of the 
six chunks. We can add this variable to the dataframe by executing: 

gogglesDotd$row<-rep(l:8, 6) 


This command uses the rep() function to create a variable row in the dataframe goggles- 
Data, that is, the numbers 1 to 8 repeated six times ( rep(l:8, 6)). The dataframe now looks 


like this (edited): 



gender 

alcohol attractiveness 

row 

1 Female 


None 

65 

1 

2 Female 


None 

70 

2 

3 Female 


None 

60 

3 

4 Female 


None 

60 

4 

5 Female 


None 

60 

5 

6 Female 


None 

55 

6 

7 Female 


None 

60 

7 

8 Female 


None 

55 

8 

9 Female 

2 

Pints 

70 

1 

10 Female 

2 

Pints 

65 

2 

11 Female 

2 

Pints 

60 

3 

12 Female 

2 

Pints 

70 

4 

13 Female 

2 

Pints 

65 

5 

14 Female 

2 

Pints 

60 

6 

15 Female 

2 

Pints 

60 

7 

16 Female 

2 

Pints 

50 

8 

Note that the 

structure is the 

same as before - 


row that identifies the scores within each combination of gender and alcohol as a value 
from 1 to 8. 

Now we have changed the data set we need to make it molten so that we can cast the data 
into the wide format. To do this we use the melt() function (see section 3.9.4). Remember 
that in this function we differentiate variables that identify attributes of the scores (in this 
case, gender, alcohol, and row all tell us about a given attractiveness score, for example, 
that it was the first score in the male group who drank 2 pints) from the scores or measured 
variables themselves. Attributes are specified with the id option, and scores with the meas¬ 
ured option. Therefore, we can create a molten dataframe called gogglesMelt by executing: 


gogglesMeltc-meltCgogglesData, id = c("row", "gender", "dlcohol"), medsured = 
cC'dttrdctiveness")) 
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FIGURE 12.15 

Restructuring the 
data for robust 
factorial ANOVA 


Gender Alcohol attractiveness 


Female 

None 

res' 


Female 

None 

70 


Female 

None 

60 


Female 

None 

60 


Female 

None 

60 


Female 

None 

55 


Female 

None 

60 


Female 
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Male 
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Male 
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Male 
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Male 

2 Pints 

70 


Male 

2 Pints 

80 


Male 

2 Pints 

60 


Male 

4 Pints 

301 


Male 

4 Pints 

30 


Male 

4 Pints 

30 


Male 

4 Pints 

55 


Male 

4 Pints 

35 


Male 

4 Pints 

20 


Male 

4 Pints 

45 


Male 

4 Pints 
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Long Format (‘Molten’) 


Wide Format 


Having melted the data, we want to cast it in the wide format using cast(). To do this we 
use a formula in the form: variables specifying the rows ~ variables specifying the columns. 
In this case, row tells us in which row to place a score, and we want the alcohol and gender 
variables split across different columns, so we’d use the formula: row ~ gender + alcohol. 
Therefore, we can make a wide dataframe called gogglesWide by executing: 

gogglesWide<-cast(gogglesMelt, row ~ gender + alcohol) 

Note that we have applied this command to the molten data set (gogglesMelt). The result 
is that the data have been transformed from the long format to the wide format. However, 
because we added the variable row to the dataframe, our new dataframe also contains this 
variable, and for the analysis we want only the alcohol and gender variables, therefore, we 
want to remove row. We can do this by executing: 

gogglesWide$row<-NULL 
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which basically zaps the variable row into oblivion. If you look at the dataframe you’ll see 
a lovely wide format set of data: 

gogglesWi.de 


F_None 

F_2 Pints 

F_4 Pints 

M_None 

M_2 Pints 

M_4 Pints 

65 

70 

55 

50 

45 

30 

70 

65 

65 

55 

60 

30 

60 

60 

70 

80 

85 

30 

60 

70 

55 

65 

65 

55 

60 

65 

55 

70 

70 

35 

55 

60 

60 

75 

70 

20 

60 

60 

50 

75 

80 

45 

55 

50 

50 

65 

60 

40 


It’s important to note the order of the columns because this affects how we specify the 
robust analysis. In this case, the hierarchy of the independent variables is gender followed 
by alcohol. In other words, we have taken the six groups and first divided them into male 
and female, then within the male and female groups we have subdivided according to the 
amount of alcohol they drank. We would say that gender is factor A and alcohol factor B. If 
this idea is not clear then Figure 12.15 might help you to visualize it. As such, the order of 
the columns reflects a 2 x 3 design (2 levels of gender divided up into 3 levels of alcohol). 
If the columns were ordered as F_None, M_None, F 2 Pints, M 2 Pints, F 4 Pints, M 4 
Pints, then we would have a 3 x 2 design (3 levels of alcohol each divided up into 2 levels 
of gender). In this case factor A would be alcohol and factor B gender. 

The function tlwayQ takes the general form: 

t2way(levels of factor A, levels of factor B, data, tr = .2, alpha = .05) 

As with other functions we’ve encountered, the level of trimming is by default 20% (tr = 
.2) but can be changed by including the tr = option. Also the default alpha level is .05 but 
can be changed by including the alpha = option. Assuming we are happy with the default 
level of trimming, we need only specify the dataframe (gogglesWide) and the levels of factor 
A (2 in this case as explained above) and factor B (3 in this case). Therefore, we can do a 
robust two-way factorial ANOVA based on trimmed means by executing: 

t2way(2,3, gogglesWide) 

The function pbad2way() has a similar format: 

pbad2way(levels of factor A, levels of factor B, data, est = mom, nboot = 
2000) 

The main differences are an option to control the number of bootstrap samples (nboot), 
although the default of 2000 is fine, and an option est to control the M-estimator that you 
want to use. You can use est = median (to use the median) or est = mom (to use a method 
based on identifying and removing outliers). In smaller samples you might find that est = 
mom throws up an error message, in which case switch to est = median. If we’re happy 
with 2000 bootstrap samples and using mom rather than median then we can run the 
analysis for the current data by executing: 2 

pbad2way(2,3, gogglesWide) 

The output of both of these commands is shown in Output 12.8. For t2ivay() (left-hand 
side of Output 12.8) we are given a test statistic for factor A ( $Qa ), factor B ($Qb) and 
their interaction ( $Qab ) as well as the corresponding p-value ($A.p.value, $B.p.value, and 


2 If you want to compare medians then execute: 
pbad2way(2,3, gogglesWide, est = median) 
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$AB.p.value respectively). Remember that factor A was gender and factor B alcohol; there¬ 
fore, we could conclude that there was no significant main effect of gender, Q = 1.67, p 
= .209, but there was a significant main effect of alcohol, Q = 48.28, p = .001, and a 
significant gender x alcohol interaction, Q = 26.26, p = .001. The bottom of the output 
shows the trimmed means on which these results are based: factor A (gender) is repre¬ 
sented by rows, and factor B (alcohol) by columns. So, for example, the trimmed mean of 
the attractiveness score for females who drank 2 pints was 63.3. 

The output of pbad2way() (right-hand side of Output 12.8) tells us much the same things 
but we get only p-values and no test statistics: there was no significant main effect of gen¬ 
der, p = .171, but there was a significant main effect of alcohol, p < .001, and a significant 
gender x alcohol interaction, p < .001. 


t2way() 

$Qa $sig.levelA 
[1] 1.666667 [1] 0.171 


pbad2way() 


$A.p.value $sig.levelB 

[1] 0.209 [1] 0 


$Qb $sig.levelAB 
[1] 48.2845 [1] 5e-04 

$B.p.value 
[ 1 ] 0.001 

$Qab 

[1] 26.25718 

$AB.p.value 
[ 1 ] 0.001 

$means 

[, 1 ] [, 2 ] [, 3 ] 

[1,] 60.0 63.33333 56.66667 
[2,] 67.5 67.50000 35.00000 


Output 12.8 

The post hoc tests for each analysis are conducted using the same command structure. 
That is, we define the number of levels of factor A, then factor B, then indicate the data- 
frame. Therefore, to run post hoc tests based on a 20% trimmed mean, we execute: 3 

mcp2atm(2,3, gogglesWide) 

To conduct post hoc tests based on an M-estimator we execute: 4 
mcp2a(2,3, gogglesWide) 

Output 12.9 shows the post hoc tests based on trimmed means ( mcplatm). The main 
effect of gender is tested by $Factor.A$test and $Factor.A$psihat. We have two choices. 


3 Obviously if you changed the level of trim for the main analysis you would need to do the same here. For 
example, for 10% trimmed means: 

t2way(2,3, gogglesWide, tr = .1) 

mcp2atm(2,3, gogglesWide, tr = .1) 

4 Remember that if you chose the median as your M-estimator then you would need to execute: 
mcp2a(2,3, gogglesWide, est = median) 
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The first is to interpret the value in the column labelled test against the critical value (crit): 
if the test value is larger than the critical value then the test is significant (at p < .05). In 
this case, 1.29 is smaller than 2.06 so the result is non-significant. The second choice is to 
interpret psihat and its confidence interval and p-value. We should focus on interpreting 
the confidence interval because (unlike the p-value) it is corrected for the number of tests. 
In this case the confidence interval crosses zero, which indicates a non-significant result. 
These tests of gender, because it contains only two levels, basically just confirm what we 
already know from the main analysis. 

The effect of alcohol ( $Factor.B$test and $Factor.B$psihat) is more interesting because 
it breaks down the main effect of alcohol. There are three contrasts to interpret, but 
how do we know what they mean? To interpret these contrasts we need to look at the 
contrast codes for factor A, B and the interaction at the bottom of the output. The rows 
labelled [1,] ... [6,] relate to the six columns of data. In other words they are: F_None, 
F_2 Pints, F_4 Pints, M_None, M_2 Pints, M_4 Pints. Remembering that groups with 
positive codes are compared against groups with negative codes, $conA tells us that A 
was the effect of gender (you have the three female groups coded with 1 and the three 
male groups coded with —1). Similarly, $conB tells us that B was the effect of alcohol 
split into three contrasts. Each contrast is in a separate column. We could rewrite this 


matrix as: 

F_None 

Coni 

1 

Con2 

1 

Con3 

0 

F_2 Pints 

-1 

0 

1 

F_4 Pints 

0 

-1 

-1 

M_None 

1 

1 

0 

M_2 Pints 

-1 

0 

1 

M_4 Pints 

0 

-1 

-1 


Remembering that 0 means that the group is not involved, and that positives are compared 
to negatives, the first contrast (column 1) compares 2 pints to none, the second contrast 
(column 2) is 4 pints compared to none, and the third (column 3) is 2 pints compared to 
4 pints). 

Finally, the codes for the interaction ($conAB) are the same as for the main effect of 
alcohol except that the plus and minus signs are reversed for males and females, which 
tests whether the effect of alcohol differs across gender. In other words, contrast 1 com¬ 
pares whether the difference between 2 pints and no alcohol is different in men and 
women. 

For the main effect of alcohol, contrast 1 is not significant (-0.52 is smaller than 2.68 
and the confidence interval for psihat crosses zero), but contrasts 2 (5.75 is greater than 
2.65 and the confidence interval for psihat does not contain zero) and 3 (6.18 is greater 
than 2.64 and the confidence interval for psihat does not contain zero) are. This indicates 
a significant difference in attractiveness scores for 4 pints compared to both no alcohol and 
2 pints, but not between 2 pints and no alcohol. 

For the interaction term, we get the same profile of results: contrast 1 is not signifi¬ 
cant (-0.52 is smaller than 2.68 and the confidence interval for psihat crosses zero), 
but contrasts 2 (-4.68 is greater than 2.65 - you can ignore the minus sign - and 
the confidence interval for psihat does not contain zero) and 3 (-4.08 is greater than 
2.64 and the confidence interval for psihat does not contain zero) are. These find¬ 
ings indicate that the difference in attractiveness scores for 4 pints compared to both 
no alcohol and 2 pints differed in men and women, but that the lack of difference 
between 2 pints and no alcohol was similar for males and females. This profile of 
results tells the same story as the factorial ANOVA that we interpreted in the main part 
of the chapter. 
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$Factor.A 
$Factor.A$test 

con.num test crit se df 

[1,] 1 1.290994 2.065879 7.745967 23.57301 


$Factor.A$psihat 

con.num psihat ci.lower ci.upper p.value 
[1,] 1 10 -6.002228 26.00223 0.2092233 


$Factor.B 
$Factor.B$test 



con.num 

test 


[1, ] 

1 

-0.5203149 

2 

[2, ] 

2 

5.7486837 

2 

[3, ] 

3 

6.1847459 

2 


crit se df 
.678921 6.406377 14.50207 
.647995 6.233311 16.11968 
.636865 6.332785 16.81814 


$Factor.B$psihat 



con.num 

psihat 

ci.lower 

[1, ] 

1 

-3.333333 

-20.49551 

[2, ] 

2 

35.833333 

19.32755 

[3, ] 

3 

39.166667 

22.46796 


ci.upper p.value 
13.82885 6.106962e-01 
52.33911 2.905447e-05 
55.86537 1.047835e-05 


$Factor.AB 
$Factor.AB$test 



con.num 

test 

crit 

[1, ] 

1 

-0.5203149 

2.678921 

[2, ] 

2 

-4.6791611 

2.647995 

[3, ] 

3 

-4.0793005 

2.636865 


se df 
6.406377 14.50207 
6.233311 16.11968 
6.332785 16.81814 


$Factor. 

AB$psihat 

con.num 

psihat 

[i, ] 

1 

-3.333333 

[2, ] 

2 

-29.166667 

[3, ] 

3 

-25.833333 


$A11.Tests 



[1] NA 




$conA 

[, 

1] 



[1, ] 

1 



[2, ] 

1 



[3, ] 

1 



[4, ] 

-1 



[5, ] 

-1 



[6, ] 

-1 



$conB 

t, 

1] 

[, 2] 

[ / 3 ] 

[1, ] 

1 

1 

0 

[2, ] 

-1 

0 

1 

[3, ] 

0 

-1 

-1 

[4, ] 

1 

1 

0 

[5, ] 

-1 

0 

1 

[6, ] 

0 

-1 

-1 


ci 

-20 

-45 

-42 


lower 

ci.upper 

49551 

13.82885 

67245 

-12.66089 

53204 

-9.13463 


p.value 
0.6106961628 
0.0002466289 
0.0007964981 
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$conAB 



[, 1] 

[ , 2] 

t / 3 ] 

[1, ] 

1 

1 

0 

[2, ] 

-1 

0 

1 

[3, ] 

0 

-1 

-1 

[4, ] 

-1 

-1 

0 

[5, ] 

1 

0 

-1 

[6, ] 

0 

1 

1 

Output 12. 

9 



Output 12.10 shows the post hoc tests based on an M-estimator (mcp2a). The interpre¬ 
tation of these results is exactly the same as for the trimmed means. If the value of sig.test 
is less than the critical value ( sig.crit ) and the confidence interval does not cross zero then 
the contrast is significant. For the main effect of alcohol we find^a significant difference in 
attractiveness scores for 4 pints compared to both no alcohol, \|/ = 35.80, p < .001, and 
2 pints, \|/ = 40.80, p < .001, but not between 2 pints and no alcohol, \|/ = -5, p = .383. 
Similarly, for the interaction term, males and females were comparable in terms of the dif¬ 
ference in attractiveness^ratings between 4 pints compared to both no alcohol, \|/ = -32213, 
p < .001, and 2 pints, \|/ = -27.23, p < .01, but not between 2 pints and no alcohol, \|/ = 
-5,p = .318. 

$FactorA 

con.num psihat sig.test sig.crit ci.lower ci.upper 
[1,] 1 14.46429 0.1515 0.025 -10.08929 28.23214 


$FactorB 



con.num 

psihat sig.test 

[1, ] 

1 

-5.00000 

0.3825 

[2, ] 

2 

35.80357 

0.0000 

[3, ] 

3 

40.80357 

0.0000 


sig.crit ci.lower ci.upper 
0.025 -18.83929 13.24405 
0.025 20.62500 51.84524 

0.025 21.25000 55.20833 


$Interactions 



con.num 

psihat 

sig.test 

[1, ] 

1 

-5.00000 

0.3180 

[2, ] 

2 

-32.23214 

0.0005 

[3, ] 

3 

-27.23214 

0.0015 


sig.crit 
0.025 
0.025 
0.025 


ci.lower 
-19.37500 
-45.20833 
-41.96429 


ci.upper 
12.500000 
-13.750000 
-9.583333 


Output 12.10 




OLIVER TWISTED 

Please Sir, can I have some 
more ... robust methods? 


‘These robust tests are not nearly complicated enough’, salivates 
Oliver with a maniacal look in his eye and suspiciously empty 
bowl of additive-ridden sweets by his side. ‘I want to add in a 
third independent variable, and then I want the magic number fer¬ 
ret to lick the brains from my skull.’ Oh dear, he’s lost it. You can 
lose it too by finding out how to do a robust three-way indepen¬ 
dent ANOVA on the companion website. If you’re lucky you 
might get a brain licking too or, at the very least, a headache. 
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12.8. Calculating effect sizes ® 



As we saw in previous chapters (e.g., sectionll.6), we can use omega squared (w 1 ) as an 
effect size measure. The calculation of a> 2 becomes somewhat more cumbersome in facto¬ 
rial designs (‘somewhat’ being one of my characteristic understatements!). Howell (2006), 
as ever, does a wonderful job of explaining the complexities of it all (and has a nice table 
summarizing the various components for a variety of situations). Condensing all of this 
down, I’ll just say that we need to first compute a variance component for each of the 
effects (the two main effects and the interaction term) and the error, and then use these to 
calculate effect sizes for each. If we call the first main effect A, the second main effect B and 
the interaction effect Ax B, then the variance components for each of these are based on 
the mean squares of each effect and the sample sizes on which they’re based: 


a 2 (fl-l)(MS A -MS R ) 
“ nab 

(b - 1)(MS b - MS r ) 




nab 


a 2 _ ( a ~ 1)(^~ 1)(MS Ax b MS r ) 


® a{5 


nab 


In these equations, a is the number of levels of the first independent variable, b is the number 
of levels of the second independent variable and n is the number of people per condition. 

We also need to estimate the total variability, and this is just the sum of these other vari¬ 
ables plus the residual mean squares: 


R 


The effect size is then the variance estimate for the effect in which you’re interested 
divided by the total variance estimate: 


2 _ T effect 

^effect - A 2 

^total 


We can write a function in R to compute the effect sizes for us (see R’s Souls’ Tip 6.2). This 
process might seem like a faff, but remember that once you have the function written, you 
can use it again and again. Output 12.4 gives us the sums of squares for each effect and the 
interaction, so it would be nice to be able to enter these values to get the resulting omega 
squared. We can write and execute this function: 


omega_factorial<-function(n, a, b, SSa, SSb, SSab, SSr) 

{ 

MSa<-SSa/(a-l) 

MSb<-SSb/(b-l) 

MSab<-SSab/((a-l)*(b-l)) 

MSr<-SSr/(a*b*(n-l)) 
varA<-((a-l)*(MSa-MSr))/(n*a*b) 
varB<-((b-l)*(MSb-MSr))/(n*a*b) 
varAB<-((a-l)*(b-l)*(MSab-MSr))/(n*a*b) 
varTotal<-varA + varB + varAB + MSr 
print(paste("Omega-Squared A: ", varA/varTotal)) 
print(paste("Omega-Squared B: ", varB/varTotal)) 
print(paste("Omega-Squared AB: ", varAB/varTotal)) 
> 
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This creates a function called omega-factorial . 5 First, we tell R that we want to be able 
to input n, a, b, SSa, SSb, SSab, and SSr into the function (these are specified in brackets). 
This means that to use the function we have to input these values in brackets in the cor¬ 
rect order. The rest of the function uses these values to compute the various values of a/. 
The first four commands take the sums of squares and convert them to mean squares by 
dividing by the degrees of freedom (rather than have you input the degrees of freedom 
by hand, we calculate them from a and b, the number of levels of the two independent 
variables). The next four lines calculate the variance estimates in the equations above; for 
example, varA computes a \ by writing out the equation above in R-speak (because of how 
I have labelled everything in the function you should be able to compare directly the com¬ 
mand in the function with the equation above). The final three lines print some text (in 
speech marks) that describes which a> 1 we’re calculating followed by each variance estimate 
divided by the total variance estimate (i.e., a ln ea /cf f otal ). 

Having executed this function we can use it to calculate a> 2 in the current data by using 
the values of n (8 people per group), a (levels of gender = 2), b (levels of alcohol = 3) and 
the four sums of squares from Output 12.4: 

omega_factorialC8, 2, 3, 169, 3332, 1978, 3488) 

Executing this command will print the following to the console: 


[1] "Omega-Squared A: 0.00949745068429" 

[1] "Omega-Squared B: 0.34982188991376" 

[1] "Omega-Squared AB: 0.200209417472152" 

For the main effect of gender we get w gender = 0.009; for the main effect of alcohol we get 
"alcohol= 0.350; and for the interaction <Wg enderxalcohol = 0.200. 

I have mentioned several times that it is perhaps more useful to quantify focused differ¬ 
ences (i.e., between two things) than overall effects. In the case of a factorial ANOVA when 
there is a significant interaction, we might compute effect sizes for the simple effects (sec¬ 
tion 12.5.10). In other words, compute the differences between means for one indepen¬ 
dent variable at different levels of the other independent variable. In the current example, 
we might compute effect sizes for the effect of gender at different levels of alcohol. We 
could again use the mes() function from the calculate.es package: 

mes(mean males , mean females , S d males , S d females , n ma i es > n f em aies^ 

We have all the information we need to use the mes() function in Output 12.2. For example, 
if we want to compare men and women who drank no alcohol we would execute: 

mes(66.875, 60.625, 10.3293963, 4.95515604, 8, 8) 

We have entered the mean of the men who drank no alcohol (66.875), the mean of women 
who drank no alcohol (60.625), the corresponding standard deviations (10.329 and 4.955), 
and the sample sizes (both 8). 

Similarly we can get effect sizes for the difference between men and women who drank 
2 pints by executing: 

mes(66.875, 62.5, 12.5178444, 6.5465367, 8, 8) 


Finally, the difference between men and women who drank 4 pints can be quantified by 
executing: 

mes(35.625, 57.5, 10.8356225, 7.0710678, 8, 8) 

The (edited) outputs of these commands are shown in Output 12.11. The difference in 
attractiveness scores between males and females who drank no alcohol is a medium effect 

5 If you install the package DSUR , which we produced for this book, you can use this function without executing 
these commands. 
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(the means are under a standard deviation different), d = 0.77, r = .36; the difference 
between males and females who drank 2 pints is a fairly small effect (there is less than half 
a standard deviation difference between the group means), d = 0.44, r = .21; finally, the 
difference between males and females who drank 4 pints is a very large effect (the means 
are more than 2 standard deviation apart), d = -2.39, r = —.77. 

No Alcohol: Males vs. Females 

$MeanDifference 

d var. d g var. g 

0.7715168 0.2686012 0.7294340 0.2400984 

$Correlation 

r var.r 

0.35990788 0.04428981 

2 Pints: Males vs. Females 
$MeanDif f erence 

d var. d g var. g 

0.4379891 0.2559948 0.4140988 0.2288298 

$Correlation 

r var.r 

0.2139249 0.0556082 


4 Pints: Males vs. Females 
$MeanDif f erence 

d var.d g var.g 

-2.3909552 0.4286458 -2.2605394 0.3831598 


$Correlation 

r var.r 

-0.767030763 0.007475955 

Output 12.11 


12.9. Reporting the results of two-way 
AN0VA © 


As with the other ANOVAs we’ve encountered, we have to report the details of the F-ratio 
and the degrees of freedom from which it was calculated. For the various effects in these 
data the F-ratio will be based on different degrees of freedom: it was derived from divid¬ 
ing the mean squares for the effect by the mean squares for the residual. For the effects of 
alcohol and the alcohol x gender interaction, the model degrees of freedom were 2 (df M = 2), 
but for the effect of gender the degrees of freedom were only 1 (df M = 1). For all effects, 
the degrees of freedom for the residuals were 42 ( df R = 42). We can, therefore, report the 
three effects from this analysis as follows: 

/ There was a significant main effect of the amount of alcohol consumed at the night¬ 
club, on the attractiveness of the mate they selected, F(2, 42) = 20.07, p < .001, 
u> 2 = .35. The Bonferroni post hoc tests revealed that the attractiveness of selected 
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dates was significantly lower after 4 pints than both after 2 pints and no alcohol 
(both ps < .001). The attractiveness of dates after 2 pints and no alcohol were not 
significantly different. 

V There was a non-significant main effect of gender on the attractiveness of selected 
mates, F(l, 42) = 2.03, p = .161, co 2 = .009. 

V There was a significant interaction effect between the amount of alcohol consumed 
and the gender of the person selecting a mate, on the attractiveness of the partner 
selected, F( 2, 42) = 11.91, p < .001, co 2 = .20. This indicates that male and female 
genders were affected differently by alcohol. Specifically, the attractiveness of part¬ 
ners was similar in males (M = 66.88, SD = 10.33) and females (M = 60.63, SD = 
4.96) after no alcohol, d = 0.77; the attractiveness of partners was also similar in 
males (M = 66.88, SD = 12.52) and females (M = 62.50, SD = 6.55) after 2 pints, 
d = 0.44; however, attractiveness of partners selected by males (M = 35.63, SD = 
10.84) was significantly lower than those selected by females (M = 57.50, SD = 
7.07) after 4 pints, d = -2.39. 



Labcoat Leni’s Real Research 12.1 


Don’t forget your 
toothbrush? (D 


Davey, G. C. L., et al. (2003). Journal of Behavior Therapy & Experimental Psychiatry, 34, 141-160. 


We have all experienced that feeling after we have left the house of wondering whether we locked the door, or 
closed the window, or whether we remembered to remove the bodies from the fridge in case the police turn 
up. This behaviour is normal; however, people with obsessive compulsive disorder (OCD) tend to check things 
excessively. They might, for example, check whether they have locked the door so often that it takes them an hour 
to leave their house. It is a very debilitating problem. 

One theory of this checking behaviour in OCD suggests that it is caused by a combination of the mood you 
are in (positive or negative) interacting with the rules you use to decide when to stop a task (do you continue until 
you feel like stopping, or until you have done the task as best as you can?). Davey, Startup, Zara, MacDonald, 
and Field (2003) tested this hypothesis by inducing a negative, positive or no mood in different people and then 
asking them to imagine that they were going on holiday and to generate as many things as they could that they 
should check before they went away. Within each mood group, half of the participants were instructed to gener¬ 
ate as many items as they could (known as an ‘as many as can’ stop rule), whereas the remainder were asked 
to generate items for as long as they felt like continuing the task (known as a ‘feel like continuing’ stop rule). The 
data are in the file Davey2003.dat. 

Davey et al. hypothesized that people in negative moods, using an ‘as many as can’ stop rule, would generate 
more items than those using a ‘feel like continuing’ stop rule. Conversely, people in a positive mood would gener¬ 
ate more items when using a ‘feel like continuing’ stop rule compared to an ‘as many as can’ stop rule. Finally, 
in neutral moods, the stop rule used shouldn’t affect the number of items generated. Draw an error bar 
chart of the data and then conduct the appropriate analysis to test Davey et al.’s hypotheses. 

Answers are in the additional material on the companion website (or look at pages 148-149 in the 
original article). 
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What have I discovered about statistics? © 


This chapter has been a whistle-stop tour of factorial ANOVA. In fact we’ll come across 
more factorial ANOVAs in the next two chapters, but for the time being we’ve just looked 
at the situation where there are two independent variables, and different people have 
been used in all experimental conditions. We started off by discovering that even com¬ 
plex ANOVAs are simply regression analyses in disguise. We moved on to look at how to 
calculate the various sums of squares in this analysis, but, most important, we saw that 
we get three effects: two main effects (the effect of each of the independent variables) 
and an interaction effect. We moved on to see how this analysis is done using R and how 
the output is interpreted. Much of this was similar to the ANOVAs we’ve come across in 
previous chapters, but one big difference was the interaction term. We spent a bit of time 
exploring interactions (and especially interaction graphs) to see what an interaction looks 
like and how to spot it. The brave readers also found out how to follow up an interaction 
with simple effects analysis. Finally, we discovered that calculating effect sizes in factorial 
designs is a complete headache and should be attempted only by the criminally insane. 
So far we’ve steered clear of repeated-measures designs, but in the next chapter I have to 
resign myself to the fact that I can’t avoid explaining them for the rest of my life.© 

We also discovered that no sooner had I started my first band than it disintegrated. I 
went with drummer Mark to sing in a band called the Outlanders, who were much bet¬ 
ter musically but were not, if the truth were told, metal enough for me. They also sacked 
me after a very short period of time for not being able to sing like Bono (an insult at the 
time, but in retrospect ...). 


R packages used in this chapter 


car 

compute.es 

ggplot2 


multcomp 

pastecs 

reshape 

WRS 


R functions used in this chapter 


AnovaO 

list() 

aov() 

ImO 

byO 

meltO 

castO 

mes() 

contrastsO 

pairwise.t.test() 

confintO 

pbad2way() 

factorO 

plotO 

ggpioto 

read.csvO 

giO 

repO 

gihto 

stat.descO 

mcp2a() 

summaryO 

mcp2atm() 

summary.lmO 

leveneTestQ 

t2way() 
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Key terms that I’ve discovered 


Beer-goggles effect 
Factorial ANOVA 
Independent factorial design 


Interaction graph 
Mixed design 
Related factorial design 
Simple effects analysis 


Smart Alex’s tasks 


• Task 1: People’s musical tastes change as they get older (my parents, for example, 
after years of listening to relatively cool music when I was a kid, subsequently hit their 
mid-forties and developed a worrying obsession with country and western music). 
This worries me immensely because the future seems bleak if it is spent listening to 
Garth Brooks and thinking ‘oh boy, did I underestimate Garth’s immense talent when 
I was in my 20s’. So, I did some imaginary research to find out whether my fate really 
was sealed, or whether it’s possible to be old and like good music too. First, I got 
two groups of people (45 people in each group): one group contained young people 
(which I arbitrarily decided was under 40 years of age) and the other group contained 
more mature individuals (above 40 years of age). This is my first independent vari¬ 
able, age. I then split each of these groups of 45 into three smaller groups of 15 and 
assigned them to listen to Fugazi (who everyone knows are the coolest band on the 
planet), 6 ABBA or Barf Grooks (a less well-known country and western musician not 
to be confused with anyone real who produces music that makes me want to barf). 
This is my second independent variable, music. After listening to the music I got each 
person to rate it on a scale ranging from —100 (please poke a pencil through my ear¬ 
drum so I don’t have to listen any more) through 0 (I am completely indifferent) to 
+ 100 (I love this music so much, it gives me a tingle down my spine). This variable 
is called liking. The data are in the file fugazi.dat. Conduct a two-way independent 
ANOVA on them. © 

• Task 2: In Chapter 3 we used some data that related to men and women’s psy¬ 
chological arousal levels when watching either Bridget Jones’s Diary or Memento 
(ChickFlick.dat). Analyse these data to see whether men and women differ in their 
reactions to different types of films. © 

• Task 3: At the start of this chapter I described a way of empirically researching 
whether I wrote better songs than my old band mate Malcolm, and whether this 
depended on the type of song (a symphony or song about flies). The outcome vari¬ 
able would be the number of screams elicited by audience members during the songs. 
These data are in the file Escape From Inside.dat. Draw an error bar graph (lines) and 
analyse and interpret these data. © 

• Task 4: Using R’s Souls’ Tip 12.2, conduct a simple effects analysis of the effect of 
alcohol at different levels of gender (which is the opposite to the example in the 
chapter). © 

• Task 5: Back in 2008, hospitals were reporting an increase in injuries related 
to playing Nintendo Wii (http://www.telegraph.co.uk/news/uknews/1576244/Spate- 
of-injuries-blamed-on-Nintendo-Wii.html). These injuries were attributed mainly to 



6 See http://www.dischord.com 
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muscle and tendon strains. A researcher was interested to see whether these inju¬ 
ries could be prevented. She hypothesized that a stretching warm-up before playing 
Wii would help lower injuries, and that athletes would be less susceptible to injuries 
because their regular activity makes them more flexible. She took 60 athletes and 
60 non-athletes (athlete), half of them played Wii and half watched others play¬ 
ing as a control (wii), and within these groups half did a 5-minute stretch routine 
before playing/watching whereas the other half did not (stretch). The outcome was 
a pain score out of 10 (where 0 is no pain, and 10 is severe pain) after playing for 4 
hours (injury). The data are in the file Wii.dat. Conduct a three-way ANOVA to test 
whether athletes are less prone to injury, and whether the prevention programme 
worked. © 

The answers are on the companion website. Task 1 is an example from Field and Hole 

(2003) and so has a more detailed answer if you feel like you want it. 


Further reading 


Howell, D. C. (2006). Statistical methods for psychology (6th ed.). Belmont, CA: Duxbury. (Or you 
might prefer his Fundamental Statistics for the Behavioral Sciences, also in its 6th edition, 2007.) 
Rosenthal, R., Rosnow, R. L., & Rubin, D. B. (2000). Contrasts and effect sizes in behavioural research: 
A correlational approach. Cambridge: Cambridge University Press. (This is quite advanced but 
really cannot be bettered for contrasts and effect size estimation.) 

Rosnow, R. L., & Rosenthal, R. (2005). Beginning behavioral research: A conceptual primer (5th ed.). 
Upper Saddle River, NJ: Pearson/Prentice Hall. (Has some wonderful chapters on ANOVA, with 
a particular focus on effect size estimation, and some very insightful comments on what interac¬ 
tions actually mean.) 


Interesting real research 


Davey, G. C. L., Startup, H. M., Zara, A., MacDonald, C. B., & Field, A. P (2003). Perseveration of 
checking thoughts and mood-as-input hypothesis. Journal of Behavior Therapy & Experimental 
Psychiatry, 34, 141-160. 





Repeated-measures 

designs (GLM 4) 13 




FIGURE 13.1 

Scansion in the 
early days; I used 
to stare a lot (from 
left to right: me, 
Mark and Mark) 


13.1. What will this chapter tell me? © 


At the age of 15, I was on holiday with my friend Mark (the drummer) in Cornwall. I had 
a pretty decent mullet by this stage (nowadays I just wish I had enough hair to grow a mul¬ 
let) and had acquired a respectable collection of heavy metal T-shirts from going to various 
gigs. We were walking along the cliff tops one evening at dusk reminiscing about our times 
in Andromeda. We came to the conclusion that the only thing we hadn’t enjoyed about 
that band was Malcolm and that maybe we should reform it with a different guitarist. 1 As I 


1 1 feel bad about saying this because Malcolm was a very nice guy and, to be honest, at that age (and some would 
argue beyond) I could be a bit of a cock. 
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was wondering who we could get to play guitar, Mark pointed out the blindingly obvious: I 
played guitar. So, when we got home Scansion was born. 2 As the singer, guitarist and song¬ 
writer, I set about writing some songs. I moved away from writing about flies and set my 
sights on the pointlessness of existence, death, betrayal and so on. We had the dubious hon¬ 
our of being reviewed in the music magazine Kerrang! (in a live review they called us ‘twee’, 
which is really not what you want to be called if you’re trying to make music so heavy that 
it ruptures the bowels of Satan). Our highlight, however, was playing a gig at the famous 
Marquee Club in London (this club has closed now, not as a result of us playing there I has¬ 
ten to add, but in its day it started the careers of people like Jimi Hendrix, The Who, Iron 
Maiden and Led Zeppelin). 3 This was the biggest gig of our career and it was essential that 
we played like we never had before. As it turned out, we did: I ran on stage, fell over and in 
the process detuned my guitar beyond recognition and broke the zip on my trousers. I spent 
the whole gig out of tune and spread-eagled to prevent my trousers falling down. Like I said, 
I’d never played like that before. We used to get quite obsessed with comparing how we 
played at different gigs. I didn’t know about statistics then (happy days) but if I had I would 
have realized that we could rate ourselves and compare the mean ratings for different gigs; 
because we would always be the ones doing the rating, this would be a repeated-measures 
design, so we would need a repeated-measures ANOVA to compare these means. That’s 
what this chapter is about; hopefully it won’t make our trousers fall down. 

13.2. Introduction to repeated-measures designs © 


Over the last three chapters we have looked at a procedure called ANOVA, which is used 
for testing differences between several means. So far we’ve concentrated on situations in 
which different entities contribute to different means; put another way, different people 
take part in different experimental conditions. Actually, it doesn’t have to be different peo¬ 
ple (I tend to say people because I’m a psychologist and so spend my life torturing, I mean 
testing, people in the name of science), it could be different plants, different companies, 
different plots of land, different viral strains, different goats or even different duck-billed 
platypuses (or whatever the plural is). Anyway, the point is that I’ve completely ignored 
situations in which the same people (plants, goats, hamsters, seven-eyed green galactic 
leaders from space, or whatever) contribute to the different means because explaining how 
to do it in R is a bit of an R-se. I’ve put it off long enough, and now I’m going to take you 
through what happens when we do ANOVA on repeated-measures data. 




SELF-TEST 

s What is a repeated-measures design? (Clue: it is 
described in Chapter 1.) 


‘Repeated measures’ is a term used when the same entities participate in all conditions of 
an experiment or provide data at multiple time points. For example, you might test the 
effects of alcohol on enjoyment of a party. Some people can drink a lot of alcohol with¬ 
out really feeling the consequences, whereas others, like myself, have only to sniff a pint 
of lager and they start flapping around on the floor waving their arms and legs around 
shouting ‘Look at me, I’m Andy, King of the lost world of the Haddocks’. Therefore, it 

2 Scansion is a term for the rhythm of poetry. We got the name by searching through a dictionary until we found 
a word that we liked. Originally we didn’t think it was ‘metal’ enough, and we decided that any self-respecting 
heavy metal band needed to have a big spiky ‘X’ in their name. So, for the first couple of years we spelt it 
‘Scanxion’. Like I said, I could be a bit of a cock back then. 


3 http://www.themarqueeclub.net 
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is important to control for individual differences in tolerance to alcohol, and this can be 
achieved by testing the same people in all conditions of the experiment: participants could 
be given a questionnaire assessing their enjoyment of the party after they had consumed 1 
pint, 2 pints, 3 pints and 4 pints of lager. 

We saw in Chapter 1 that this type of design has several advantages; however, there is 
a big disadvantage if you’re going to use ANOVA to analyse your data. In Chapter 10 we 
saw that the accuracy of the F-test in ANOVA depends upon the assumption that scores in 
different conditions are independent (see section 10.3). When repeated measures are used 
this assumption is violated: scores taken under different experimental conditions are likely 
to be related because they come from the same participants. As such, the conventional 
F-test will lack accuracy. The relationship between scores in different treatment conditions 
means that an additional assumption has to be made and, put simplistically, we assume 
that the relationship between pairs of experimental conditions is similar (i.e., the level of 
dependence between experimental conditions is roughly equal). This assumption is called 
the assumption of sphericity, which, trust me, is a pain in the neck to try to pronounce 
when you’re giving statistics lectures at 9 a.m. 


13 . 2 . 1 . 


The assumption of sphericity © 


The assumption of sphericity can be likened to the assumption of homogeneity of variance 
in between-group ANOVA. Sphericity (denoted by e and sometimes referred to as circular¬ 
ity) is a more general condition of compound symmetry. Compound symmetry 
holds true when both the variances across conditions are equal (this is the same 
as the homogeneity of variance assumption in between-group designs) and the 
covariances between pairs of conditions are equal. So, we assume that the varia¬ 
tion within experimental conditions is fairly similar and that no two conditions 
are any more dependent than any other two. Although compound symmetry has 
been shown to be a sufficient condition for ANOVA using repeated-measures 
data, it is not a necessary condition. Sphericity is a less restrictive form of com¬ 
pound symmetry (in fact, much of the early research into repeated-measures 
ANOVA confused compound symmetry with sphericity). Sphericity refers to the 
equality of variances of the differences between treatment levels. So, if you were 
to take each pair of treatment levels, and calculate the differences between each pair of 
scores, then it is necessary that these differences have approximately equal variances. As 
such, you need at least three conditions for sphericity to be an issue. 



13 . 2 . 2 . 


How is sphericity measured? © 


If we were going to check the assumption of sphericity by hand, which incidentally only a 
complete lunatic would do, then we could start by calculating the differences between pairs 
of scores in all combinations of the treatment levels. Once this has been done, we could 
calculate the variance of these differences. Table 13.1 shows data from an experiment with 
three conditions. The differences between pairs of scores are computed for each partici¬ 
pant and the variance for each set of differences is calculated. We saw above that sphericity 
is met when these variances are roughly equal. For these data, sphericity will hold when: 

Variance AB ~ Variance AC ~ Variance BC 

In these data there is some deviation from sphericity because the variance of the differ¬ 
ences between conditions A and B (15.7) is greater than the variance of the differences 
between A and C (10.3) and between B and C (10.7). However, these data have local 
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circularity (or local sphericity) because two of the variances of differences are very similar. 
Therefore, the sphericity assumption has been met for any multiple comparisons involving 
these conditions (for a discussion of local circularity see Rouanet &C Lepine, 1970). The 
deviation from sphericity in the data in Table 13.1 does not seem too severe (all variances 
are roughly equal), but can we assess whether a deviation is severe enough to warrant action? 


Table 13.1 Hypothetical data to illustrate the calculation of the variance of the differences 
between conditions 


Group A 

GroupB 

Group C 

A-B 

A-C 

CD 

1 

O 

10 

12 

8 

-2 

2 

4 

15 

15 

12 

0 

3 

3 

25 

30 

20 

-5 

5 

10 

35 

30 

28 

5 

7 

2 

30 

27 

20 

3 

10 

7 



Variance: 

157 

10.3 

10.7 


13 . 2 . 3 . 


Assessing the severity of departures from 
sphericity © 


Sphericity can be assessed using a test known as Mauchly’s test, which tests the hypothesis 
that the variances of the differences between conditions are equal. Therefore, if Mauchly’s 
test statistic is significant (i.e., has a probability value less than .05) we should conclude 
that there are significant differences between the variances of differences and, therefore, 
the condition of sphericity is not met. If, however, Mauchly’s test statistic is non-significant 
(i.e., p > .05) then it is reasonable to conclude that the variances of differences are not sig¬ 
nificantly different (i.e., they are roughly equal). So, in short, if Mauchly’s test is significant 
then we must be wary of the resulting F-ratios. However, like any significance test, it is 
dependent on sample size: in big samples small deviations from sphericity can be signifi¬ 
cant, and in small samples large violations can be non-significant. 


13 . 2 . 4 . 


What is the effect of violating the assumption 
of sphericity? © 



Rouanet and Lepine (1970) provided a detailed account of the validity of the F-ratio 
under violations of the sphericity assumption. They argued that there are two different 
F-ratios that can be used to assess treatment comparisons, labelled F' and F", respectively. 
F' refers to an F-ratio derived from the mean squares of the comparison in question and 
the specific error term for the comparison of interest - this is the F-ratio normally used. F" 
is derived not from the specific error mean square but from the total error mean squares 
for all repeated-measures comparisons. Rouanet and Lepine (1970) showed that for F" 
to be valid, overall sphericity must hold (i.e., the whole data set must be spherical), but 
for F' to be valid, sphericity must hold for the specific comparison in question (see also 
Mendoza, Toothaker, &t Crain, 1976). F' is the statistic generally used, and the effect of 
violating sphericity is a loss of power (compared to when F" is used) and a test statistic 
(F-ratio) that simply cannot be compared to tabulated values of the F-distribution (see 
Oliver Twisted). 
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OLIVER TWISTED 

Please Sir, can I have some 
more ... sphericity? 


‘Balls’ says Oliver, ‘are spherical, and I like balls. Maybe I’ll like sphe¬ 
ricity too if only you could explain it to me in more detail.’ Be care¬ 
ful what you wish for, Oliver. In my youth I wrote an article called 
‘A bluffer’s guide to sphericity’, which I used to cite in this book, 
roughly on this page. Occasionally people ask me for it, so I thought 
I might as well reproduce it in the additional material for this chapter. 




Not only does sphericity create problems for the F in repeated-measures ANOVA, but 
also it causes some amusing complications for post hoc tests (Jane Superbrain Box 13.1). If 
you don’t want to worry about what these complications are then the take-home message 
is that when sphericity is violated, the Bonferroni method seems to be generally the most 
robust of the univariate techniques, especially in terms of power and control of the Type I 
error rate. When sphericity is definitely not violated, Tukey’s test can be used. 



JANE SUPERBRAIN 13.1 

Sphericity and post hoc tests ® 

The violation of sphericity has implications for mul¬ 
tiple comparisons. Boik (1981) provided an estimable 
account of the effects of non-sphericity on post hoc 
tests In repeated-measures designs, and concluded that 
even very small departures from sphericity produce large 
biases in the F-test. He recommends against using these 
tests for repeated-measure contrasts. When experimental 
error terms are small, the power to detect relatively strong 
effects can be as low as .05 (when sphericity = .80). Boik 
argues that the situation for multiple comparisons can¬ 
not be improved and concludes by recommending a 
multivariate analogue. Mitzel and Games (1981) found 
that when sphericity does not hold (e < 1) the pooled 
error term conventionally employed in pairwise compari¬ 
sons resulted in non-significant differences between two 
means declared significant (i.e., a lenient Type I error 
rate) or undetected differences (a conservative Type I 
error rate). Mitzel and Games, therefore, recommended 
the use of separate error terms for each comparison. 

Maxwell (1980) systematically tested the power and alpha 
levels for five post hoc tests under repeated-measures condi¬ 
tions. The tests assessed were Tukey’s wholly significant dif¬ 
ference (WSD) test, which uses a pooled error term; Tukey’s 


procedure but with a separate error term with either n-1 
degrees of freeedom (labelled SEP1) or (n-1)(k-1) degrees 
of freeedom (labelled SEP2); Bonferroni’s procedure (BON); 
and a multivariate approach, the Roy-Bose simultaneous 
confidence interval (SCI). Maxwell (1980) tested these a priori 
procedures, varying the sample size, number of levels of the 
repeated factor and departure from sphericity. He found that 
the multivariate approach was always ‘too conservative for 
practical use' (p. 277), and this was most extreme when n 
(the number of participants) is small relative to k (the num¬ 
ber of conditions). Tukey’s test inflated the alpha rate unac¬ 
ceptably with increasing departures from sphericity even 
when a separate error term was used (SEP1 and SEP2). The 
Bonferroni method, however, was extremely robust (although 
slightly conservative) and controlled alpha levels regardless 
of the manipulation. Therefore, in terms of Type I error rates, 
the Bonferroni method was best. 

In terms of test power (the Type II error rate) for a small 
sample (n = 8) Maxwell found WSD to be most power¬ 
ful under conditions of non-sphericity, but this advantage 
was severely reduced when n = 15. 

Keselman and Keselman (1988) extended Maxwell’s 
work within unbalanced designs. They too used Tukey’s 
WSD, a modified WSD (with non-pooled error variance), 
Bonferroni f-statistics and a multivariate approach, and 
found that when unweighted means were used (with 
unbalanced designs) none of the four tests could control 
the Type I error rate. When weighted means were used 
only the multivariate tests could limit alpha rates, although 
Bonferroni f-statistics were considerably better than the two 
Tukey methods. In terms of power, Keselman and Keselman 
(1988) concluded that ’as the number of repeated treatment 
levels increases, BON is substantially more powerful than 
SCI’ (p. 223). 
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13 . 2 . 5 . 


What do you do if you violate sphericity? <D 


If data violate the sphericity assumption, there are several corrections that 
can be applied to produce a valid F-ratio. There are three commonly used 
corrections based upon the estimates of sphericity advocated by Greenhouse 
and Geisser (1959) and Huynh and Feldt (1976). Both of these estimates 
give rise to a correction factor that is applied to the degrees of freedom used 
to assess the observed F-ratio. The calculation of these estimates is beyond 
the scope of this book (interested readers should consult Girden, 1992); we 
need know only that the three estimates differ. The Greenhouse-Geisser cor¬ 
rection (usually denoted as e) varies between l/(k— 1), where k is the number 
of repeated-measures conditions, and 1. The closer s is to 1, the more homo¬ 
geneous the variances of differences, and hence the closer the data are to being spherical. 
For example, in a situation in which there are five conditions the lower limit of e will be 
1/(5 —1), or .25 (known as the lower-bound estimate of sphericity). 

Huynh and Feldt (1976) reported that when the Greenhouse-Geisser estimate is greater 
than .75 too many false null hypotheses fail to be rejected (i.e., the correction is too con¬ 
servative) and Collier, Baker, Mandeville, and Hayes (1967) showed that this was also true 
when the sphericity estimate was as high as .90. Huynh and Feldt, therefore, proposed 
their own less conservative correction (usually denoted as s' ). However, Maxwell and 
Delaney (1990) report that F overestimates sphericity. Stevens (2002) therefore recom¬ 
mends taking an average of the two and adjusting df by this averaged value. Girden (1992) 
recommends that when estimates of sphericity are greater than .75 the Huynh-Feldt correc¬ 
tion should be used, but when sphericity estimates are less than .75 or nothing is known 
about sphericity at all, then the Greenhouse-Geisser correction should be used instead. We 
will see how these values are used in due course. 

Given that violations of sphericity affect the accuracy of F, a second option when you 
have data that violate sphericity is to use a test other than F. The first possibility is to use 
multivariate test statistics (multivariate analysis of variance, MANOVA), because they are 
not dependent upon the assumption of sphericity (see O’Brien &t Kaiser, 1985). MANOVA 
is covered in depth in Chapter 16, but we can get R to produce multivariate test statistics 
in the context of repeated-measures ANOVA. However, there may be trade-offs in power 
between these univariate and multivariate tests (see Jane Superbrain Box 13.2). A second 
possibility is to analyse the data as a multilevel model (described in detail in Chapter 19). 
This idea probably sounds a bit scary, but it is simply a regression in which we can include 
multiple observations from the same entities. If we analyse the data in this way then we 
can interpret the model coefficients without worrying about sphericity because dummy¬ 
coding our grouping variables ensures that these coefficients only ever compare two things 
(and sphericity is only an issue when comparing three or more means). Also, the model fit 
can be tested without an F-ratio, and if we’re feeling really brave we can explicitly model 
the assumed relationship between observations at different time points (this is called the 
covariance structure and is described in Chapter 19). Although we cover a basic ANOVA 
approach to be consistent with what you might well be taught, we recommend the multi¬ 
level approach and demonstrate that as well for all of the examples. 



13.3. Theory of one-way 
repeated-measures ANOVA © 


In a repeated-measures ANOVA the effect of our experiment is shown up in the within- 
participant variance (rather than in the between-group variance). Remember that in 
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JANE SUPERBRAIN 13.2 

Power in ANOVA and MANOVA <D 


There is a trade-off in test power between univariate and 
multivariate approaches (although some authors argue 
that this can be overcome with suitable mastery of the 
techniques - O’Brien and Kaiser, 1985). Davidson (1972) 
compared the power of adjusted univariate techniques 
with those of Hotelling’s P (a MANOVA test statistic) 
and found that the univariate technique was relatively 
powerless to detect small reliable changes between 
highly correlated conditions when other less correlated 
conditions were also present. Mendoza, Toothaker, and 
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Nicewander (1974) conducted a Monte Carlo study com¬ 
paring univariate and multivariate techniques under vio¬ 
lations of compound symmetry and normality and found 
that ‘as the degree of violation of compound symmetry 
increased, the empirical power for the multivariate tests 
also increased. In contrast, the power for the univariate 
tests generally decreased’ (p. 174). Maxwell and Delaney 
(1990) noted that the univariate test is relatively more 
powerful than the multivariate test as n decreases and 
proposed that ‘the multivariate approach should prob¬ 
ably not be used if n is less than a + 10 (a is the number 
of levels for repeated measures)’ (p. 602). As a rule it 
seems that when you have a large violation of spheric¬ 
ity (e < .7) and your sample size is greater than (a + 10) 
then multivariate procedures are more powerful, but with 
small sample sizes or when sphericity holds (e > .7) the 
univariate approach is preferred (Stevens, 2002). It is 
also worth noting that the power of MANOVA increases 
and decreases as a function of the correlations between 
dependent variables (see Jane Superbrain Box 16.1) and 
so the relationship between treatment conditions must be 
considered. 


independent ANOVA (section 10.2) the within-participant variance is our residual vari¬ 
ance (SS R ); it is the variance created by individual differences in performance. This var¬ 
iance is not contaminated by the experimental effect, because whatever manipulation 
we’ve carried out has been done on different people. However, when we carry out our 
experimental manipulation on the same people, the within-participant variance will be 
made up of two things: the effect of our manipulation and, as before, individual differ¬ 
ences in performance. So, some of the within-subject variation comes from the effects of 
our experimental manipulation: we did different things in each experimental condition 
to the participants, and so variation in an individual’s scores will partly be due to these 
manipulations. For example, if everyone scores higher in one condition than another, it’s 
reasonable to assume that this happened not by chance, but because we did something dif¬ 
ferent to the participants in one of the conditions compared to any other one. Because we 
did the same thing to everyone within a particular condition, any variation that cannot be 
explained by the manipulation we’ve carried out must be due to random factors outside 
our control, unrelated to our experimental manipulations (we could call this ‘error’). As 
in independent ANOVA, we use an F-ratio that compares the size of the variation due to 
our experimental manipulations to the size of the variation due to random factors, the 
only difference being how we calculate these variances. If the variance due to our manipu¬ 
lations is big relative to the variation due to random factors, we get a big value of F, and 
we can conclude that the observed results are unlikely to have occurred if there was no 
effect in the population. 

Figure 13.2 shows how the variance is partitioned in a repeated-measures ANOVA. The 
important thing to note is that we have the same types of variances as in independent 
ANOVA: we have a total sum of squares (SS T ), a model sum of squares (SS M ) and a residual 
sum of squares (SS R ). The only difference between repeated-measures and independent 
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FIGURE 13.2 

Partitioning 
variance for 
repeated- 
measures ANOVA 



ANOVA is from where those sums of squares come: in repeated-measures ANOVA the 
model and residual sums of squares are both part of the within-participant variance. Let’s 
have a look at an example. 

I’m a Celebrity, Get Me Out of Here! is a TV show in which celebrities (well, they’re 
not really celebrities as such, more like ex-celebrities), in a pitiful attempt to salvage 
their careers (or just have careers in the first place), go and live in the jungle in Australia 
for a few weeks. During the show these contestants have to do various humiliating and 
degrading tasks to win food for their camp mates. These tasks invariably involve creepy- 
crawlies in places where creepy-crawlies shouldn’t go; for example, you might be locked 
in a coffin full of rats, forced to put your head in a bowl of large spiders, or have eels 
and cockroaches poured onto you. It’s cruel, voyeuristic, gratuitous, car-crash TV, 
and I love it. As a vegetarian, a particular favourite task for me is the bushtucker tri¬ 
als in which the celebrities have to eat things like live stick insects, witchetty grubs, fish 
eyes and kangaroo testicles/penises. Honestly, seeing a fish eye exploding in someone’s 
mouth forever scars your mental image of them. I’ve often wondered (perhaps a little too 
much) which of the bushtucker foods is the most revolting. Imagine that I tested this by 
getting eight celebrities, and forced them to eat four different animals (the aforementioned 
stick insect, kangaroo testicle, fish eye and witchetty grub) in counterbalanced order. On 
each occasion I measured the time it took the celebrity to retch, in seconds. This is a 
repeated-measures design because every celebrity eats every food. The independent vari¬ 
able was the type of food eaten and the dependent variable was the time taken to retch. 

The data for this example are in Table 13.2. There were four foods, each eaten by eight 
different celebrities. Their times taken to retch are shown. In addition, the mean amount of 
time to retch for each celebrity is shown in the table (and the variance in the time taken to 
retch), and also the mean time to retch for each item eaten. The total variance in retching 
time will, in part, be caused by the fact that different animals are more or less palatable (the 
manipulation), and will, in part, be caused by the fact that the celebrities themselves will 
differ in their constitution (individual differences). 
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Table 13.2 Data for the bushtucker example 


Celebrity 

Stick insect 

Kangaroo testicle 

Fish eye 

Witchetty grub 

Mean 

s 2 

1 

8 

7 

1 

6 

5.50 

9.67 

2 

9 

5 

2 

5 

5.25 

8.25 

3 

6 

2 

3 

8 

4.75 

7.58 

4 

5 

3 

1 

9 

4.50 

11.67 

5 

8 

4 

5 

8 

6.25 

4.25 

6 

7 

5 

6 

7 

6.25 

0.92 

7 

10 

2 

7 

2 

5.25 

15.58 

8 

12 

6 

8 

1 

6.75 

20.92 

Mean 

8.13 

4.25 

4.13 

5.75 




13 . 3 . 1 . 


The total sum of squares (SS T ) (D 


Remember from one-way independent ANOVA that SS T is calculated using the following 
equation (see equation (10.4)): 

SS T =4and(N-D 

Well, in repeated-measures designs the total sum of squares is calculated in exactly the 
same way. The grand variance in the equation is simply the variance of all scores when we 
ignore the group to which they belong. So if we treated the data as one big group it would 
look as follows: 



8 

7 

1 

6 

9 

5 

2 

5 

6 

2 

3 

8 

5 

3 

1 

9 

8 

4 

5 

8 

7 

5 

6 

7 

10 

2 

7 

2 

12 

6 

8 

1 


Grand Mean 

= 5.56 



Grand Variance = 8.19 


The variance of these scores is 8.19 (try this on your calculator). We used 32 scores to 
generate this value, so N is 32. As such the equation becomes: 

SS T =s g 2 rand (N-l) 

= 8.19(32-1) 

= 253.89 
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The degrees of freedom for this sum of squares, as with the independent ANOVA, will be 
N- 1, or 31. 


13 . 3 . 2 . 


The within-participant sum of squares (SS W ) © 


The crucial difference in this design is that there is a variance component called the within- 
participant variance (this arises because we’ve manipulated our independent variable 
within each participant). This is calculated using a sum of squares. Generally speaking, 
when we calculate any sum of squares we look at the squared difference between the mean 
and individual scores. This can be expressed in terms of the variance across scores and the 
number of scores on which that variance is based. For example, when we calculated the 
residual sum of squares in independent ANOVA (SS R ) we used the following equation (look 
back to equation (10.7)): 

SS R = X (Xi-xf 
1=1 

= s 2 (n -1) 


This equation gave us the variance between individuals within a particular group, and so is 
an estimate of individual differences within a particular group. Therefore, to get the total 
value of individual differences we have to calculate the sum of squares within each group 
and then add them up: 

SSr — Sgroup I I — 1) ^group2 ( w 2 — -P ^group3 ( w 3 — 1) + • • "SgroupH ( w j/ — 1) 

This is all well and good when we have different people in each group, but in repeated- 
measures designs we’ve subjected people to more than one experimental condition, and, 
therefore, we’re interested in the variation not within a group of people (as in independ¬ 
ent ANOVA) but within an actual person. That is, how much variability is there within an 
individual? To find this out we actually use the same equation but we adapt it to look at 
people rather than groups. So, if we call this sum of squares SS W (for within-participant SS) 
we could write it as: 

SS W = SpersonlK ~l) +S person2( W 2 “ 1) + • • • + ^person;; ( n „ ~ 1) 

This equation simply means that we are looking at the variation in an individual’s scores 
and then adding these variances for all the people in the study. The ns simply represent 
the number of scores on which the variances are based (i.e., the number of experimental 
conditions, or in this case the number of foods). 

All of the variances we need are in Table 13.2, so we can calculate SS W as: 

= S celebrityl ( M 1 _ 1) + S celebrity 2 ( W 2 ~ 1) + • • • + ^celebrity n ( W « “ 1) 

= 9.67(4 -1) + 8.25(4 -1) + 7.58(4 -1) +11.67(4 -1) + 4.25(4 -1) + 

0.92(4 -1) +15.5 8(4 -1) + 20.92(4 -1) 

= 29 + 24.75 + 22.75 + 35 + 12.75 + 2.75 + 46.75 + 62.75 
= 236.50 
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The degrees of freedom for each person are n — 1 (i.e., the number of conditions minus 1). 
To get the total degrees of freedom we add the dfs for all participants. So, with eight participants 
(celebrities) and four conditions (i.e., n = 4), there are 3 degrees of freedom for each celebrity 
and 8 x 3 = 24 degrees of freedom in total. 


13 . 3 . 3 . 


The model sum of squares (SS M ) (D 


So far, we know that the total amount of variation within the data is 253.58 units. We 
also know that 236.50 of those units are explained by the variance created by individuals’ 
(celebrities’) performances under different conditions. Now some of this variation is the 
result of our experimental manipulation and some of this variation is simply random fluc¬ 
tuation. The next step is to work out how much variance is explained by our manipulation 
and how much is not. 

In independent ANOVA, we worked out how much variation could be explained by our 
experiment (the model SS) by looking at the means for each group and comparing these 
to the overall mean. So, we measured the variance resulting from the differences between 
group means and the overall mean (see equation (10.5)). We do exactly the same thing with 
a repeated-measures design. First we calculate the mean for each level of the independent 
variable (in this case the mean time to retch for each food) and compare these values to the 
overall mean of all foods. 

So, we calculate this SS in the same way as for independent ANOVA: 

1 Calculate the difference between the mean of each group and the grand mean. 

2 Square each of these differences. 

3 Multiply each result by the number of participants that contribute to that 
mean (« ). 

4 Add the values for each group together: 

k 

^ n k( X k ~ X grand') 
n =1 

Using the means from the bushtucker data (see Table 13.2), we can calculate SS M as follows: 

SS M = 8(8.13 - 5.56) 2 + 8(4.25 - 5.56) 2 + 8(4.13 - 5.56) 2 + 8(5.75 - 5.56) 2 
= 8(2.57) 2 + 8(—1.31) 2 + 8(-1.44) 2 + 8(0.19) 2 
= 83.13 

For SS M , the degrees of freedom (df M ) are again one less than the number of things used 
to calculate the sum of squares. For the model sums of squares we calculated the sum of 
squared errors between the four means and the grand mean. Hence, we used four things 
to calculate these sums of squares. Therefore, the degrees of freedom will be 3. So, as with 
independent ANOVA the model degrees of freedom are always the number of conditions 
(k) minus 1: 


dfu = k - 1 = 3 




560 


DISCOVERING STATISTICS USING R 


13 . 3 . 4 . 


The residual sum of squares (SS D ) <D 

K 


We now know that there are 253.58 units of variation to be explained in our data, and that 
the variation across our conditions accounts for 236.50 units. Of these 236.50 units, our 
experimental manipulation can explain 83.13 units. The final sum of squares is the residual 
sum of squares (SS R ), which tells us how much of the variation cannot be explained by 
the model. This value is the amount of variation caused by extraneous factors outside of 
experimental control. Knowing SS W and SS M already, the simplest way to calculate SS R is to 
subtract SS M from SS W (SS R = SS W — SS M ): 

SSr = ss w — SS M 
= 236.50-83.13 
= 153.37 

The degrees of freedom are calculated in a similar way: 

df K = df-yj — df u 
= 24-3 
= 21 


13 . 3 . 5 . 


The mean squares (D 


SS M tells us how much variation the model (e.g., the experimental manipulation) explains 
and SS R tells us how much variation is due to extraneous factors. However, because both 
of these values are summed values the number of scores that were summed influences 
them. As with independent ANOVA we eliminate this bias by calculating the average sum 
of squares (known as the mean squares , MS), which is simply the sum of squares divided 
by the degrees of freedom: 


SS M 83.13 
„ Sr = SS |l = 15337 

df R 21 


= 27.71 

= 7.30 


MS m represents the average amount of variation explained by the model (e.g., the system¬ 
atic variation), whereas MS R is a gauge of the average amount of variation explained by 
extraneous variables (the unsystematic variation). 


13 . 3 . 6 . 


The F-ratio (D 


The T-ratio is a measure of the ratio of the variation explained by the model and the vari¬ 
ation explained by unsystematic factors. It can be calculated by dividing the model mean 
squares by the residual mean squares. You should recall that this is exactly the same as for 
independent ANOVA: 


p _ ms m 
ms r 
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So, as with the independent ANOVA, the F-ratio is still the ratio of systematic variation to 
unsystematic variation. As such, it is the ratio of the experimental effect to the effect on 
performance of unexplained factors. For the bushtucker data, the F-ratio is: 


F = 


ms m 

ms r 


27.71 

7.30 


3.79 


This value is greater than 1, which indicates that the experimental manipulation had some 
effect above and beyond the effect of extraneous factors. As with independent ANOVA, 
this value can be compared against a critical value based on its degrees of freedom (which 
are df M and df R , which are 3 and 21 in this case). 


13 . 3 . 7 . 


The between-participant sum of squares © 


I mentioned that the total variation is broken down into a within-participant variation and 
a between-participant variation. We sort of forgot about the between-participant variation 
because we didn’t need it to calculate the F-ratio. However, I will just briefly mention what 
it represents. The easiest way to calculate this term is by subtraction, because we know 
from Figure 13.2 that: 


SS T = SS B + SS W 

Now, we have already calculated SS T and SS w so by rearranging the equation and replacing 
the values of these terms, we get: 

SS B = SS T - SS W 
= 253.89-236.50 
= 17.39 


This term represents individual differences between cases. So, in this example, different 
celebrities will have different tolerances of eating these sorts of food. This is shown by the 
means for the celebrities in Table 13.2. For example, celebrity 4 (M = 4.50) was, on aver¬ 
age, more than 2 seconds quicker to retch than participant 8 (M = 6.75). Celebrity 8 had a 
better constitution than celebrity 4. The between-participant sum of squares reflects these 
differences between individuals. In this case only 17.08 units of variation in the times to 
retch can be explained by individual differences between our celebrities. 



13.4. One-way repeated-measures designs using R © 


13 . 4 . 1 . 


Packages for repeated measures designs in 


R© 


I’ve discovered four ways of doing repeated-measures designs of the sort that you might 
analyse with ANOVA; there might be more, but once you’ve found four there really isn’t 
much incentive to keep checking. These methods are: 

• Anova()\ We’ve used this function in other chapters, and it is good in that it pro¬ 
duces sphericity tests and corrections like you might be used to seeing in other sta¬ 
tistics packages. However, the process involved in using the function doesn’t follow 
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naturally from the way I have taught ANOVA thus far, so I decided not to cover this 
method. If you’re curious though then a search engine is your friend. 

• lm() or aov() : These functions follow most naturally from how we have done ANOVAs in 
the previous chapters. However, they don’t produce sphericity test. Also, if you’re going 
to take this linear model approach you’re better off doing it using lme(). So, I will men¬ 
tion this method for continuity but mainly I use it as a stepping stone to describe lme(). 

• lme(): This function enables you to do a regression in which observations can be corre¬ 
lated (which happens in repeated-measures designs). This form of regression is known as 
a multilevel model and is covered in more detail in Chapter 19. Therefore, this method 
naturally develops what we have already learnt about the lm() function. Using lme() has 
the benefit that we can forget about sphericity, and it’s also how ‘proper’ statisticians 
would deal with repeated-measures data. The downside is that this approach is a little 
different from what you might be used to if you have ever analysed repeated-measures 
designs (i.e., you are not using ANOVA) with other packages such as SPSS or SAS. 

• ezANOVA(): In case lme() freaks you out we’ll also look at a function called ezANOVA, 
which as the name suggests enables you to do ANOVA easily (and in a way that 
closely matches other statistics packages that you might have used). 

If you’re using commands (which actually you have to), then you will need the packages ez 
(if you’re going to use ANOVA), ggplotl (for graphs), multcomp (for post hoc tests), nlme 
(if you decide to use a multilevel model), pastecs (for descriptive statistics), reshape (for 
reshaping the data) and WRS (for robust tests). If you do not have these packages installed 
(some should be installed from previous chapters), you can install them by executing the 
following commands: 

install.packages("ez"); install. packages("ggplot2") ; install.packages 
("multcomp"); install.packages("nlme"); install.packages("pastecs"); 
install.packages("reshape"); install.packages("WRS", repos="http://R-Forge. 
R-project.org") 

You then need to load these packages by executing these commands: 

library(ez); library(ggplot2); library(multcomp); library(nlme); 

library(pastecs); library(reshape); library(WRS) 


13 . 4 . 2 . 


General procedure for repeated-measures designs 


To conduct repeated-measures analysis you should follow this general procedure: 

1 Enter data : which turns out not to be as straightforward as you might think. 

2 Explore your data-, you know the routine by now - graphs, descriptive statistics and 
maybe even a bit of sphericity checking if you’re not going to use lme(). 

3 Construct or choose contrasts : you need to decide what contrasts to do and to specify 
them appropriately for all of the independent variables in your analysis. 

4 Compute the ANOVA/multilevel model : you can then run the main analysis. Depending 
on what you found in the previous step, you might need to run a robust test. 

5 Compute contrasts or post hoc tests: having conducted the main analysis you can fol¬ 
low it up with post hoc tests or look at the results of your contrasts. Again, the exact 
methods you choose will depend upon what you unearth in step 2. 

We will work through these steps in turn. 
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13 . 4 . 3 . 


Repeated-measures ANOVA using R Commander © 


Figure 13.3 shows how to do repeated-measures ANOVA using R Commander: you can’t, 
sad faces all round. 



FIGURE 13.3 

Repeated- 
measures 
ANOVA using R 
Commander 


13 . 4 . 4 . 


Entering the data © 


The data for the example can be found in the file Bushtucker.dat. You can load this data file 
by setting your working directory to the appropriate location and executing: 

bushData<-recfd.delim("Bushtucker.dat", header = TRUE) 

I have structured the data in the format that you’d be most likely to use if you had entered 
the data in another software package and followed the usual conventions. The data have 
been entered in ‘wide’ format; that is, levels of the repeated-measures variable are spread 
across different columns. 



participant 

PI 

P2 

P3 

P4 

P5 

P6 

P7 

P8 


stick_insect 

8 

9 

6 

5 

8 

7 

10 

12 


kangaroo_testicle 


7 

5 

2 

3 

4 

5 
2 

6 


1 

2 

3 

1 

5 

6 

7 

8 


fish_eye witchetty_grub 
6 
5 
8 
9 
8 
7 
2 
1 


These data were originally entered in Excel, and, as you can see, I created a column in 
which I entered text that identifies each participant (PI, P2 etc.). The remaining four 
columns represent each participant’s time to retch after consuming each of the four food 
types. For example, participant 7 took 10 seconds to retch after the stick insect, 2 after the 
kangaroo testicle and witchetty grub and 7 after the fish eye. 

Although the format of the data follows typical conventions, to run the analysis in R we need 
the data to be in the long format. We can do this using the melt() function, which we’ve used 
many times before (e.g., Chapter 3). Remember that in this function we specify columns in the 
data that identify characteristics of the scores (such as from whom they originate) using the id 
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option, and columns that identify the scores themselves using the measured option. In this case 
our scores are split over four columns (stick jnsect, kangaroo Jesticle, fish_eye, witcbetty_grub), 
so these are our measured variables, and participant tells us from whom the scores originate, so 
this is our id variable. We can create a new dataframe (called longBush) by executing: 

longBush<-melt(bushData, id = "participant", measured = c("stick_insect", 
"kangaroo_testicle", "fish_eye", "witchetty_grub")) 

This dataframe contains three columns: the first identifies the participant, the second 
identifies the type of food, and the third contains the scores (time to retch). By default, 
these columns will be named participant, variable, and value, which are not the most help¬ 
ful of labels. Let’s rename these columns so that we actually know what they represent by 
executing: 

names(longBush)<-c("Participant", "Animal", "Retch") 

Finally, let’s convert the Animal variable to a factor with suitable labels by executing: 

longBush$Animal<-factor(longBush$Animal, labels = c("Stick Insect", "Kangaroo 
Testicle", "Fish Eye", "Witchetty Grub")) 

The data now look like this: 4 



Participant 

Animal 

Retch 

1 

PI 

Stick Insect 

8 

9 

PI 

Kangaroo Testicle 

7 

17 

PI 

Fish Eye 

1 

25 

PI 

Witchetty Grub 

6 

2 

P2 

Stick Insect 

9 

10 

P2 

Kangaroo Testicle 

5 

18 

P2 

Fish Eye 

2 

26 

P2 

Witchetty Grub 

5 

3 

P3 

Stick Insect 

6 

11 

P3 

Kangaroo Testicle 

2 

19 

P3 

Fish Eye 

3 

27 

P3 

Witchetty Grub 

8 

4 

P4 

Stick Insect 

5 

12 

P4 

Kangaroo Testicle 

3 

20 

P4 

Fish Eye 

1 

28 

P4 

Witchetty Grub 

9 

5 

P5 

Stick Insect 

8 

13 

P5 

Kangaroo Testicle 

4 

21 

P5 

Fish Eye 

5 

29 

P5 

Witchetty Grub 

8 

6 

P6 

Stick Insect 

7 

14 

P6 

Kangaroo Testicle 

5 

22 

P6 

Fish Eye 

6 

30 

P6 

Witchetty Grub 

7 

7 

P7 

Stick Insect 

10 

15 

P7 

Kangaroo Testicle 

2 

23 

P7 

Fish Eye 

7 

31 

P7 

Witchetty Grub 

2 

8 

P8 

Stick Insect 

12 

16 

P8 

Kangaroo Testicle 

6 

24 

P8 

Fish Eye 

8 

32 

P8 

Witchetty Grub 

1 


4 To make it clearer that there are four observations for each person I have sorted the data by participant by 
executing: 

longBush<-longBush[order(longBush$Participant),] 
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Notice that each participant (identified by the Participant variable) has four scores (dis¬ 
tinguished by the variable Animal). These four scores are now represented by four different 
rows rather than four columns as they were before. 

If we wanted to enter the data directly into R, we would first need to create the variable 
that identifies participants by using th egl() function (Chapter 3). Remember that this func¬ 
tion takes the general form: 

factorc-glCnumber of levels, cases in each level, total cases, labels = 
cC'Tabell", "label2"...)) 

This function creates a factor variable called factor, you specify the number of levels or 
groups of the factor, how many cases are in each level/group, optionally the total number 
of cases (the default is to multiply the number of groups by the number of cases per group), 
and you can also use the labels option to list names eight participants, so we can specify it as: 

Participant<-gl(8, 4, labels = c("Pl", "P2", "P3", "P4", "P5", "P6", "P7", 
"P8" )) 

The numbers in the function tell R that we had eight sets of four scores, and the labels 
option then specifies the names to attach to these eight sets, which correspond to their par¬ 
ticipant number. To create the Animal variable we want four groups, each containing one 
score. This will create four cases (4x1 = 4), or, put another way, it will create the codes 
for the first participant. However, we want this pattern to be repeated for the remaining 
participants; we can do this by adding a third value to the function that is the total number 
of cases (i.e., 32). By specifying the total number of cases the gl() function will repeat the 
pattern of four codes until it reaches this total number of cases: 

Animal<-glC4, 1, 32, labels = c("Stick Insect", "Kangaroo Testicle", "Fish 
Eye", "Witchetty Grub")) 

We can add the times to retch by creating a numeric variable in the usual way: 

Retch<-c(8, 7, 1, 6, 9, 5, 2, 5, 6, 2, 3, 8, 5, 3, 1, 9, 8, 4, 5, 8, 7, 5, 
6, 7, 10, 2, 7, 2, 12, 6, 8, 1) 

Finally, we can merge these variables into a dataframe called longBusb by executing: 
longBush<-data.frame(Participant, Animal, Retch) 


13 . 4 . 5 . 


Exploring the data (D 


As ever, we’ll look at some graphs first. Let’s start with the means across the different 
conditions. 



SELF-TEST 

s Use ggplot2 to plot a bar graph (with error bars) of 
the time to retch with the type of animal eaten on the 
x-axis. 


The resulting plot (Figure 13.4) shows that on average celebrities were quickest to retch 
after eating a testicle or eyeball (the means are lowest). Comparatively speaking, the stick 
insect was the most palatable because it took the longest time to induce retching. 
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FIGURE 13.4 

Mean time to retch 
after eating four 
different animals 
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We can also look at boxplots for the time taken to retch after eating the different animals. 



Figure 13.5 shows boxplots for these data. These show a similar profile to the bar chart: 
the median time to retch is highest for the stick insect and lowest for the testicle and eye¬ 
ball. In addition, we can see that the middle part of the distribution of scores is a little more 
spread out for the fish eye and witchetty grub (the ‘boxes’ are longer) than the testicle and 
stick insect. 

We have previously used the by() function and the stat.descQ function in the pastecs 
package to get descriptive statistics for separate groups (see Chapter 5 for more detail). 
Therefore, if we wanted to explore the effects of the type of animal on retching times, we 
could do so by executing: 

by(longBush$Retch, longBush$Animal, stat.desc) 
longBush$Animal: Stick Insect 

























CHAPTER 13 REPEATED-MEASURES DESIGNS (GLM 4) 


567 


median 

mean 

SE.mean Cl.mean.0.95 

var 

std 

. dev 

coef.var 

8.0000 
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: Kangaroo Testicle 
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longBush$Animal: 
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9155 
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Output 13.1 

The resulting (edited) output is in Output 13.1. From this table we can see that, on 
average, the time taken to retch was longest after eating the stick insect, and shortest after 
eating a testicle or eyeball. These mean values are useful for interpreting any significant 
effects that the main analysis throws up (pun intended). 


12 - 


10 - 


at 

■o 

c 

o 

o 

<u 

w 


8 - 


3 6- 

<D 

E 


c 

re 

<D 


4 - 


2 - 



Stick Insect Kangaroo Fish Eye Witchetty Grub 

Testicle 

Type of Animal Eaten 


FIGURE 13.5 

Boxplots of the 
bushtucker data 




















568 


DISCOVERING STATISTICS USING R 


13 . 4 . 6 . 


Choosing contrasts (D 


It’s useful to follow up the main analysis with contrasts that break down the main effect 
(or effects) and tell us where the differences between groups lie. For one-way independent 
ANOVA, we entered codes that defined the contrasts we want to do. We can follow the 
same procedure for repeated-measures designs. As we have seen before, if we want to look 
at Type III sums of squares (see Jane Superbrain Box 11.1) then we must specify contrasts, 
and the contrasts must also be orthogonal, otherwise the resulting sums of squares will not 
be the Type III ones that we’re expecting. 

Let’s imagine that we predicted that because eyes and testicles resemble human body 
parts, celebrities would be more disgusted by eating them than witchetty grubs and stick 
insects (which are eaten whole and don’t resemble anything very human). Our first con¬ 
trast might, therefore, compare the fish eye and kangaroo testicle (combined) to the witch¬ 
etty grub and stick insect (combined). We need a second contrast then to separate the fish 
eye from the kangaroo testicle, and a third contrast to separate the witchetty grub from the 
stick insect. The resulting codes are in Table 13.3. 


Table 13.3 Orthogonal contrasts for the Animal variable 


Group 

Contrast 1 

Contrast 2 

Contrast 3 

Stick insect 

1 

0 

-1 

Kangaroo testicle 

-1 

-1 

0 

Fish eye 

-1 

1 

0 

Witchetty grub 

1 

0 

1 


To set these orthogonal contrasts (see Chapter 10) we can first create variables represent¬ 
ing each contrast (which is useful mainly because you can give the contrasts informative 
names), and then bind these variables together and set them as the contrast for Animal: 

PartvsWhole<-c(l, -1, -1, 1) 

TesticlevsEye<-c(0, -1, 1, 0) 

StickvsGrub<-c(-l, 0, 0, 1) 

contrasts(longBush$Animal)<-cbind(PartvsWhole, TesticlevsEye, StickvsGrub) 

The first three commands each create a variable relating to a contrast that contains the 
codes for each group from Table 13.3. The final command sets these three variables to be 
the contrasts for Animal. We can check that we have set the contrast correctly by executing 
the name of the variable and looking at the contrast attribute: 

longBush$Animal 


attr(,"contrasts") 

PartvsWhole 


Stick Insect 1 
Kangaroo Testicle -1 
Fish Eye -1 
Witchetty Grub 1 


TesticlevsEye 

0 

-1 

1 

0 


StickvsGrub 

-1 

0 

0 

1 


Remembering that positive numbers are compared with negative and a zero means that the 
group is not involved at all, we can clearly see that the first contrast compares the eye and 
testicle (combined) with the stick insect and grub (combined). The second contrast ignores 
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the stick insect and witchetty grub and compares the testicle to the eye (not a sentence I’d 
ever envisaged using in a statistics textbook). The third contrast ignores the testicle and eye 
and compares the stick insect to the witchetty grub. 


13 . 4 . 7 . 


Analysing repeated measures: two ways to 
skin a .dat (D 


13.4.7.1. The easier (but slightly limited) way: 
repeated-measures ANOVA (D 

To conduct an ANOVA using a repeated-measures design we can use the function ezANOVA() 
in the package ez. The advantage of this method is that it produces an output that resem¬ 
bles what you’ll be used to seeing if you have ever attempted repeated-measures ANOVA 
using a different package (such as SPSS or SAS). It will also compute sphericity estimates 
and the aforementioned corrections for sphericity. The general format of this function is: 

newModel<-ezANOVACdata = dataFrame, dv = .(outcome variable), wid = .(variable 
that identifies participants), within = .(repeated measures predictors), 
between = .(between-group predictors), detailed = FALSE, type = 2) 

This creates a model ( newModel ) from your dataframe ( dataFrame ). You can then set the 
following options: 

• dv: This is the variable containing the scores (i.e., the outcome variable). In this case 
the outcome was the time to retch, which is represented by the variable Retch. 

• wid: ezANOVA requires a variable that identifies the participants so that it can ascer¬ 
tain from which participant a given score came. In our current dataframe this is the 
variable Participant. 

• within: This is a variable or list of variables representing the independent variables 
or predictors that were manipulated as repeated measures. In the current data this 
would be the variable Animal, which represents the type of food that was eaten. 

• between: This is a variable or list of variables representing the independent variables 
or predictors that were manipulated as between-group variables. In the current data 
we don’t have a variable manipulated in this way, but if you have a mixed design (as 
in the next chapter) you will need this option. 

• detailed: This option is set to FALSE by default, but setting it to TRUE gives you a 
slightly more detailed (and in my opinion useful) output. 

• type: This option determines the type of sums of squares. If omitted it defaults to type 
= 2, which produces Type II sums of squares. If you want Type III sums of squares 
(Jane Superbrain Box 11.1) then change this option to type = 3. 

Note that some of these options take the form option =.(). Placing lists of variables within 
.() is just a convention of this function. It does not have any special significance, and does 
not have the power to turn you into a dragon. 

Based on this description, hopefully you can see that we can run the ANOVA by executing: 

bushModel<-ezANOVA(data = longBush, dv = .(Retch), wid = .(Participant), 
within = .(Animal), detailed = TRUE, type = B) 
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To see the output execute the model name: 
bushModel 

$ANOVA 

Effect DFn DFd SSn SSd F p p<.05 ges 

(Intercept) 1 7 990.125 17.375 398.899 1.973536e-07 * 0.8529 

Animal 3 21 83.125 153.375 3.794 2.557030e-02 * 0.3274 

$'Mauchly's Test for Sphericity 1 

Effect W p p<.05 

2 Animal 0.136248 0.04684581 * 


$’Sphericity Corrections' 


Effect GGe p[GG] p[GG]<.05 HFe p[HF] p[HF]c.05 

Animal 0.5328456 0.06258412 0.6657636 0.04833061 * 


Output 13.2 


Output 13.2 shows the results from ezANOVAQ . We’ll begin with the sphericity informa¬ 
tion. Mauchly’s test for sphericity (see also R’s Souls’ Tip 13.1) should be non-significant 
if we are to assume that the condition of sphericity has been met. The important column is 
the one containing the significance value ( p ) and in this case the value, .047, is less than the 
critical value of .05 (which is why there is an asterisk next to the p-value), so we reject the 
assumption that the variances of the differences between levels are equal. In other words, 
the assumption of sphericity has been violated, W = 0.14, p = .047. Knowing that we have 
violated this assumption a pertinent question is: how should we proceed? 

We discovered earlier that there are two corrections based upon the estimates of spheri¬ 
city advocated by Greenhouse and Geisser (1959) and Huynh and Feldt (1976). Both of 
these estimates give rise to a correction factor that is applied to the degrees of freedom 
used to assess the observed F-ratio. The closer the Greenhouse-Geisser correction, s, is to 
1, the more homogeneous the variances of differences, and hence the closer the data are 
to being spherical. In a situation in which there are four conditions (as with our data) the 
lower limit of e will be 1/(4 —1), or .33. Output 13.2 shows that the calculated value of s 
is .533 (GGe in the output). This is closer to the lower limit of .33 than it is to the upper 
limit of 1 and it therefore represents a substantial deviation from sphericity. The output 
also contains the Huynh-Feldt estimate (HFe in the output), which is slightly closer to 1 



My Mauchly’s test has vanished © 


Sometimes the output for Mauchly’s test is nowhere to be found. It has gone, vanished, been sucked into the 
void. ‘I must have done something wrong’, you think to yourself. You check your commands, rerun them, perhaps 
you reinstall R and try it all again. Still they will not return. Perhaps you rob a bank and buy a new computer, but 
still nothing. In despair, you turn to alcohol. Eventually a budding research career has evaporated like the alcohol 
on your breath. 

Actually, you haven’t done anything wrong, so hold off on buying the gin, just for a while. The reason for the 
missing output is that (as I mentioned in section 13.2.1) you need at least three conditions for sphericity to be 
an issue (read that section if you want to know why). Therefore, if you have a repeated-measures variable that 
has only two levels then sphericity is met. Hence, the estimates of sphericity will be 1 (perfect sphericity) and the 
resulting significance test cannot be computed. Therefore, no output is generated. Maybe, a nice touch would be 
for it to print ‘Hooray! Hooray! Sphericity has gone away!’ We can dream. 
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than the Greenhouse-Geisser estimate (worryingly, it’s value is .666, which could be proof 
that our data are evil). We will come back to these estimates very shortly. 

Output 13.2 also shows the results of the ANOVA for the within-subject variable. This 
table can be read in much the same way as for one-way between-group ANOVA (see 
Chapter 10). There is a sum of squares for the repeated-measures effect of Animal, which 
tells us how much of the total variability is explained by the experimental effect. Note 
the value of 83.13, which is the model sum of squares (SS M ) that we calculated in section 
13.3.3. There is also an error term (SSd in the output), which is the amount of unexplained 
variation across the conditions of the repeated-measures variable. This is the residual sum 
of squares (SS R ) that was calculated in section 13.3.4, and note that the value is 153.38 
(which is the same value as we calculated). As I explained earlier, these sums of squares are 
converted into mean squares by dividing by the degrees of freedom. As we saw before, the 
df for the effect of Animal (DFn in the output) is simply k — 1, where k is the number of 
levels of the independent variable. The error df (DFd in the output) is (n — \)(k —\), where 
n is the number of participants (or in this case, the number of celebrities) and k is as before. 
The F-ratio is obtained by dividing the mean squares for the experimental effect (27.71) by 
the error mean squares (7.30). As with between-group ANOVA, this test statistic represents 
the ratio of systematic variance to unsystematic variance. The value of F = 3.79 (the same 
as we calculated earlier) is then compared against a critical value for 3 and 21 degrees of 
freedom. R displays the exact significance level for the F-ratio. The significance of F is 
.026, which is significant because it is less than the criterion value of .05: the output help¬ 
fully places an asterisk next to any values that are significant at .05 in the column labelled 
p<.05. We can, therefore, conclude that there was a significant difference between the four 
animals in their capacity to induce retching when eaten. However, this main test does not 
tell us which animals differed from each other. 

Although this result seems very plausible, we have learnt that the violation of the sphe¬ 
ricity assumption makes the F-test inaccurate. We know from Output 13.2 that these data 
were non-spherical and so we need to make allowances for this violation. Output 13.2 also 
contains p-values that have been corrected using the Greenhouse-Geisser and Huynh-Feldt 
corrections (Jane Superbrain Box 13.3); these are labelled p[GG] and p[HF], respectively. 
For these data the corrections result in the observed F being non-significant when using the 
Greenhouse-Geisser correction (because p = .063, which is greater than .05). However, it 
was noted earlier that this correction is quite conservative, and so can miss effects that gen¬ 
uinely exist. It is, therefore, useful to consult the Huynh-Feldt corrected p-value as well. 



JANE SUPERBRAIN 13.3 

Adjusting for sphericity (D 

The Greenhouse-Geisser and Huynh-Feldt adjusted 
p-values are calculated by making an adjustment to 


the degrees of freedom associated with the Fstatistic 
(therefore, the critical value against which the obtained 
F-statistic is compared changes). The Fratio itself remains 
unchanged. The degrees of freedom are adjusted by 
multiplying them by the estimate of sphericity shown 
in Output 13.2 (see the previous Oliver Twisted). For 
example, the Greenhouse-Geisser estimate of sphericity 
was .533. The original degrees of freedom for the model 
were 3; this value is corrected by multiplying by the esti¬ 
mate of sphericity (3 x .533 = 1.599). Likewise the error 
df was 21; this value is corrected in the same way (21 x 
.533 = 11.19). The Fratio is then tested against a critical 
value with these new degrees of freedom (1.599, 11.19). 
The Huynh-Feldt correction is applied in the same way. 
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Using this correction, the F-value is still significant because the probability value of .048 is 
just below the criterion value of .05. So, by this correction, we would accept the hypothesis 
that the type of animal eaten affected the time to retch. However, it was also noted earlier 
that this correction is quite liberal and so tends to accept values as significant when, in real¬ 
ity, they are not significant. This leaves us with the puzzling dilemma of whether or not to 
accept this F-statistic as significant (and also illustrates how ridiculous it is to have a fixed 
criterion like .05 against which to determine significance). 

I mentioned earlier that Stevens (2002) recommends taking an average of the two esti¬ 
mates, and certainly when the two corrections give different results (as is the case here) 
this can be useful. If the two corrections give rise to the same conclusion it makes little 
difference which you choose to report (although if you accept the F-statistic as significant 
you might as well report the more conservative Greenhouse-Geisser estimate to avoid 
criticism). Although it is easy to calculate the average of the two correction factors and 
to correct the degrees of freedom accordingly, it is not so easy to then calculate an exact 
probability for those degrees of freedom. Therefore, should you ever be faced with this 
perplexing situation (and to be honest that’s fairly unlikely) I recommend taking an aver¬ 
age of the two significance values to give you a rough idea of which correction is giving the 
most accurate answer. In this case, the average of the two p-values is (.063 + .048)/2 = .056. 
Therefore, we should probably go with the Greenhouse-Geisser correction and conclude 
that the F-ratio is non-significant. 

These data illustrate how important it is to use a valid critical value of F: it can potentially 
mean the difference between making a Type I error and not. However, it also highlights how 
arbitrary it is that we use a .05 level of significance. These two corrections produce signifi¬ 
cance values that differ by only .015 and yet they lead to completely opposite conclusions. 
The decision about ‘significance’ has, in some ways, become rather arbitrary. The F, and hence 
the size of effect, is unaffected by these corrections and so whether the p falls slightly above or 
slightly below .05 is less important than how big the effect is. We might be well advised to look 
at an effect size to see whether the effect is substantive regardless of its significance. 

In terms of post hoc tests, we can use the pairwise.t.test() function that we have used in 
previous chapters (see Chapter 10 in particular). The format of this option is exactly the 
same as before except that we need to add the option paired = TRUE to reflect the fact that 
means are dependent (so, we’re asking for paired t-tests rather than independent t-tests). 
To get post hoc tests for the current data, execute: 

pairwise.t.test(longBush$Retch, longBush$Animal, paired = TRUE, p.adjust, 
method = "bonferroni") 

Pairwise comparisons using paired t tests 
data: longBush$Retch and longBush$Animal 

Stick Insect Kangaroo Testicle Fish Eye 
Kangaroo Testicle 0.0121 
Fish Eye 0.0056 1.0000 

Witchetty Grub 1.0000 1.0000 1.0000 

P value adjustment method: bonferroni 

Output 13.3 

Output 13.3 shows the results of the post hoc tests. We can see that the time to retch was 
significantly longer after eating a stick insect compared to a kangaroo testicle (p = .012) 
and a fish eye (p = .006) but not compared to a witchetty grub. The time to retch after 
eating a kangaroo testicle was not significantly different than after eating a fish eyeball or 
witchetty grub (both ps > .05). Finally, the time to retch was not significantly different after 
eating a fish eyeball compared to a witchetty grub (p > .05). 
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13.4.7.2. The slightly more complicated way: 
the multilevel approach (D 

The most complicated thing about the slightly more complicated way is trying to explain 
it; it is not actually that hard to do. The method we will use is known as a multilevel linear 
model, and the whole of Chapter 19 is dedicated to explaining these models. Therefore, 
I’m going to gloss over some of the details and refer you to that chapter if you want to get 
a better understanding of what we’re doing. In short, a multilevel model is simply a regres¬ 
sion or linear model that considers dependency in the data. We learnt in Chapter 7 that 
one of the assumptions of regression was that residuals (errors) needed to be independent. 
If they’re not, demons will rise from statistics hell and waggle their reproductive organs at 
us. Repeated-measures designs, as we have seen, have dependent data, therefore depend¬ 
ent residuals; if we try to analyse them with an ordinary regression, we need to prepare 
ourselves to run screaming from the demons’ .... 

A multilevel model is an extension of regression that handles dependent data by expli¬ 
citly modelling the dependency. It is, therefore, very well suited to repeated-measures 
experimental designs. One advantage of this approach is that we can continue to think 
about the analysis as a linear model; we just use a different function, lme(), rather than 
aov(), which will allow us to model the fact that some scores come from the same entities 
and are, therefore, correlated. 

We saw in Chapter 10 that we can write a one-way ANOVA as a linear model in R as: 
newModel<-aov(outcome ~ predictor, data = dataFrame) 

In the current example we’re trying to predict the time taken to retch (Retch) from the 
type of animal eaten (Animal), and our dataframe is called longBush. Therefore, we could 
write our model as: 

bushModel<-aov(Retch ~ Animal, data = longBush) 

However, this model takes no account of the fact that the predictor (Animal) is made up 
of data from the same people (and thus dependent). As it stands, we would violate an 
assumption of our linear model. We need to factor this dependency into the model. 5 We 
can do this using lme(), which has an option random that enables us to specify that there is 
variability in participants’ propensities to retch within the variable Animal. 

The general format of lme() is as follows: 

newModel <-lme(outcome ~ predictor(s), random = random effects, data = 
dataFrame, method = "ML") 

The key thing to focus on is that this command is basically exactly the same as if we had 
used the lm() and aov() functions. We simply specify our outcome and predictor variables. 
However, there are two additional options. The first is to specify a method. The default 
is something known as restricted maximum-likelihood estimation (REML), but - for vari¬ 
ous reasons that we’ll get into in Chapter 19 - it’s preferable to use maximum likelihood 
(ML), so always set the method to be “ML” when doing repeated-measures ANOVA. The 
second option is random =, which allows us to specify any random effects. Again, we’ll get 
into what random effects are in Chapter 19, because it’s quite complicated. However, for 


5 We can do this using aov() by adding an error term to the model that is based on within-participant variability 
across different animals, Error (Participant/Animal): 

bushModel<-aov(Retch ~ Animal + Error(Participant/Animal), data = longBush) 

However, the resulting model would still be assessed using an F-ratio which means that we need to worry about 
sphericity, which is slightly irritating because aov() won’t throw out sphericity-corrected estimates, or indeed the 
estimates themselves. For this reason, I favour using lme() and forgetting that sphercity even exists. 
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now I’ll just say that in the current context a random effect is an effect that can vary across 
different entities. For example, if we want to model the fact that people’s overall threshold 
to retch will vary, we can write this as random = ~1 \ Participant / Animal. All this means 
is that if you look at the variable Participant within the variable Animal (that’s what the 
‘Participant/Animal’ bit means), then overall levels (that’s represented by 1) of the outcome 
(time to retch) vary. By including this term, we’re telling the model that data with the same 
value of Participant within different levels of Animal are dependent (i.e., from the same 
person). I know, it’s a bit of a head crusher. 

Whether any of that made sense or not, trust me that we can specify our model as 

bushModel<-lme(Retch ~ Animal, random = ~11 Participant/Animal, data = long- 
Bush, method = "ML") 

Notice that we have defined the model in exactly the same was as for aov(), we have simply 
added in a term that lets the model know that the variable Animal is made up of the same 
participants repeated multiple times across the variable Animal ( random = ~1 \ Participant/ 
Animal ). If we execute this model and ask for a summary, we will get a set of parameters 
that relate to the contrasts that we set earlier for Animal. If we want to test whether Animal 
had an overall effect, then we need to compare the model that we have just created to 
one in which the predictor is absent. To do this, we create another model, but rather than 
include Animal as a predictor, we include only the intercept (which we denote with ‘1’). As 
such we create the baseline model as follows: 

baseline<-lme(Retch ~ 1, random = ~11 Participant/Animal, data = longBush, 
method = "ML") 

Notice that this command is exactly the same as before except that the model is ‘Retch ~ 1’ 
rather than ‘Retch ~ Animal’. 6 By comparing these models we can see whether adding the 
variable Animal as a predictor significantly improves the model (in other words, by using 
group means to predict the speed of retching, does the model fit the data better than when 
we don’t include this predictor?). To compare the models (see section 7.8.4.2) execute: 

anovafbaseline, bushModel) 

Output 13.5 shows the comparison of the baseline model and the model that includes 
Animal as a predictor ( bushModel ). The degrees of freedom change from 4 for the baseline 
model to 7 for bushModel , which is a difference of 3. This is because Animal has been 
coded with three contrasts, which means that three parameters (one for each contrast) have 
been added to the model. The AIC and BIC tell us about the fit of the model (smaller values 
mean a better fit). The fact that these values are smaller in the final model than the baseline 
tells us that the fit of the model has got better. The likelihood ratio (L.Ratio in the output) 
tells us whether this improvement in fit is significant, and because the p-value of .0054 is 
less than .05 it is. Therefore, Animal is a significant predictor of Retch. We can conclude, 
then, that the type of animal consumed had a significant effect on the time taken to retch, 
X 2 (3) = 12.69, p = . 005. 

Model df AIC BIC logLik Test L.Ratio p-value 

baseline 1 4 165.0875 170.9504 -78.54373 

bushModel 2 7 158.3949 168.6551 -72.19747 1 vs 2 12.69253 0.0054 

Output 13.4 


6 In actual fact when we write ‘Retch ~ Animal’ the model that we get is ‘Retch — 1 + Animal’. The ‘1’ is the 
intercept and R incudes it automatically (which is why we don’t have to explicitly mention the ‘1’ when we start 
including predictors in the model). You can see, then, that the baseline and final models differ only in the inclusion 
of Animal as a predictor; therefore, if the final model is a significantly better fit of the data than the baseline then 
this finding tells us that Animal is a significant predictor of Retch. 
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We can further explore the model by executing: 
summary(bushModel) 

Output 13.5 shows the parameter estimates for the model. Most important, these include 
the parameters for the three contrasts that we set. First, when we compare whole animals 
(stick insect and witchetty grub combined) to animal parts (testicle and eye) retching times 
were significantly different, b = 1.38, t( 21) = 3.15, p = .005. From the descriptive statistics 
(Output 13.1) it looks as though people retched more quickly after eating parts of animals 
((4.25 + 4.125)/2 = 4.188) than whole animals ((8.125 + 5.75)/2 = 6.938). The second 
contrast tells us that there was no significant difference in the time to retch after eating a 
kangaroo testicle and a fish eye, b = -0.063, t( 21) = -0.101, p = .920. The final contrast 
tells us that there was a trend for retching times to be shorter after eating a witchetty grub 
(M = 5.75) than a stick insect (M = 8.125), b = -1.188, t( 21) = -1.924, p = .068. 

Formula: —1 | Animal %in% Participant 
(Intercept) Residual 
StdDev: 2.309935 0.01176165 


Fixed effects: Retch ~ Animal 


Value Std.Error DF 
(Intercept) 5.5625 0.4365423 21 
AnimalPartvsWhole 1.3750 0.4365423 21 
AnimalTesticlevsEye -0.0625 0.6173641 21 
AnimalStickvsGrub -1.1875 0.6173641 21 


t-value p-value 
12.742178 0.0000 

3.149752 0.0048 

-0.101237 0.9203 

-1.923500 0.0681 


Output 13.5 


Although the contrasts tell us everything we need to know, if there had not been a 
logical set of contrasts to do we might have done post hoc tests. We can, of course, use 
the pairwise.t.testQ function as explained in the previous section. Flowever, by doing the 
analysis in the slightly more complicated way, we also have the option to use the glht() 
function that we have used in previous chapters (see Chapter 10 in particular). The format 
of this option is exactly the same as when we have used this function before; therefore, to 
get post hoc tests for the current data, execute: 

postHocs<-glht(bushModel, linfct = mcp(Animal = "Tukey")) 

summary(postHocs) 

confint(postHocs) 


Linear Hypotheses: 




Estimate 

Std. Error 

z value 

Pr(>|z| 

Testicle - Stick Insect 

== 

0 -3.875 

1.155 

-3.355 

0.00444 

Fish Eye - Stick Insect 

== 

0 -4.000 

1.155 

-3.463 

0.00319 

Witchetty - Stick Insect 

= 

= 0 -2.375 

1.155 

-2.056 

0.16759 

Fish Eye - Testicle == 0 


-0.125 

1.155 

-0.108 

0.99955 

Witchetty - Testicle == 

0 

1.500 

1.155 

1.299 

0.56371 

Witchetty - Fish Eye == 

0 

1.625 

1.155 

1.407 

0.49492 

Simultaneous Confidence Intervals 



Multiple Comparisons of 

Means: Tukey 

Contrasts 



Linear Hypotheses: 



Estimate 

lwr 

upr 

Kangaroo Testicle - Stick 

Insect == 0 

-3.8750 

-6.8401 

-0.9099 

Fish Eye - Stick Insect 

== 

0 

-4.0000 

-6.9651 

-1.0349 


* * 
* * 
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Witchetty Grub - Stick Insect == 0 
Fish Eye - Kangaroo Testicle == 0 
Witchetty Grub - Kangaroo Testicle 
Witchetty Grub - Fish Eye == 0 


-2.3750 
-0.1250 
0 1.5000 

1.6250 


-5.3401 0.5901 
-3.0901 2.8401 
-1.4651 4.4651 
-1.3401 4.5901 


Output 13.6 

Output 13.6 shows the results of the post hoc tests. We can see that the time to retch was 
significantly longer after eating a stick insect compared to a kangaroo testicle (p = .004) 
and a fish eye (p = .003) but not compared to a witchetty grub. The time to retch after 
eating a kangaroo testicle was not significantly different to after eating a fish eyeball or 
witchetty grub (both ps > .05). Finally, the time to retch was not significantly different after 
eating a fish eyeball compared to a witchetty grub (p > .05). 



CRAMMING SAM’S TIPS 


Repeated-measures ANOVA 


• The one-way repeated-measures ANOVA compares several means, when those means have come from the same partici¬ 
pants; for example, if you measured people’s statistical ability each month over a year-long course. 

• There are several ways to do repeated-measures ANOVA. One is a conventional ANOVA approach using the ezAN0VA() func¬ 
tion; the other is to use a multilevel linear model using the lme() function. 

• In repeated-measures ANOVA there is an additional assumption: sphericity. This assumption needs to be considered only 
when you have three or more repeated-measures conditions. If you use the ezAN0VA() function then test for sphericity using 
Mauchly’s test. If the p-value is less than .05 then the assumption is violated. If the significance of Mauchly’s test is greater 
than .05 then the assumption of sphericity has been met. 

• If the assumption of sphericity has been met then use the p-value for the main ANOVA. If the assumption was violated then 
read the p-value corrected using either the Greenhouse-Geisser (p[GG]) or Huynh-Feldt (p[HF']) estimate of sphericity (read 
this chapter to find out the relative merits of the two procedures). If the p-value is less than .05 then the means of the groups 
are significantly different. 

• If you use lme() then you can forget about sphericity. 

• For contrasts and post hoc tests, again look to the p-values to discover if your comparisons are significant (they will be if the 
significance value is less than .05). 


13 . 4 . 8 . 


Robust one-way repeated-measures ANOVA (D 


As with the other ANOVAs we have encountered, Wilcox (2005) describes robust proce¬ 
dures for conducting one-way repeated-measures ANOVA. To access these we need to again 
load the WRS package (see section 5.8.4.). There are four functions that we will look at: 

• rmanova(): This performs one-way repeated-measures ANOVA on trimmed means. 

• rmmcp(): This performs post hoc tests for one-way repeated-measures design based 
on trimmed means. 

• rmanovab(): This performs one-way repeated-measures ANOVA using a bootstrap 
procedure. 

• pairdepb(): This performs post hoc tests for the above function. 

These functions need the data to be in wide format rather than long (see Chapter 3). 
However, the data were originally in this format so we can simply reuse these (remember 
they are stored in an object called bushData). The data look like this: 
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participant 

PI 

P2 

P3 

P4 

P5 

P6 

P7 

P8 


stick_insect 

8 

9 

6 

5 

8 

7 

10 

12 


kangaroo_testicle 


7 

5 

2 

3 

4 

5 
2 

6 


1 

2 

3 

1 

5 

6 

7 

8 


fish_eye witchetty_grub 
6 
5 
8 
9 
8 
7 
2 
1 


We want only the scores so we need to get rid of the participant variable. The partici¬ 
pant variable is in the first column, so we could create a new dataframe ( bushData.2) that 
excludes this first column by executing: 

bushData2<-bushData[, -c(l)] 

This command takes the bushData object and retains all of the rows (hence no command 
before the comma) but drops column 1 by specifying — c(l), the minus sign means ‘delete’ 
in this context. The new dataframe now contains only the scores: 

bushData2 


stick_insect 

8 

9 

6 

5 

8 

7 

10 

12 


kangaroo_testicle 


7 

5 

2 

3 

4 

5 
2 

6 


1 

2 

3 

1 

5 

6 

7 

8 


fish_eye witchetty_grub 
6 
5 
8 
9 
8 
7 
2 
1 


The function rmanovaQ takes the general form: 
rmanova(data, tr = .2) 

As with other functions we’ve encountered, the level of trimming is by default 20% (tr = 
.2), but can be changed by including the tr = option. Also, the default alpha level is .05. 
Assuming we are happy with the default level of trimming, we need only specify the data¬ 
frame ( bushData! ); therefore, we can do one-way repeated-measures ANOVA based on 
trimmed means by executing: 

rmanova(bushData2) 

The function rmanovabQ has the format: 

rmanovab(data, tr = .2, alpha = .05, nboot = 599) 

The main differences are an option to control the number of bootstrap samples (nboot), 
and an option to change the level of significance (the default of .05 is fine though). I would 
normally use 2000 bootstrap samples, so if we wanted to change this option, but leave the 
default level of trim (20%) and alpha (.05) then we can run the analysis for the current 
data by executing: 

rmanovab(bushData2, nboot = 2000) 

The output of both of these commands is shown in Output 13.7. For rmanova() (left- 
hand side of Output 13.7) we are given a test statistic, F, for the effect of animal ($test), the 
degrees of freedom ($df), the p-value ($siglevel), the group means ($tmeans). Given that 
the significance level (.1002) is greater than .05, we can say that there were no significant 
differences in retch times after eating different animals, T(2.31, 11.55) = 2.75, p = .100. 
(Note that I have reported the test statistic, its degrees of freedom and the p-value, which 
you can find in the output.) 
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rmanova() 

rmanovabQ 

[1] "The number of groups 
to be compared is" 

1] "The number of groups 
to be compared is" 

[1] 4 

[1] 4 

$test 

$teststat 

[1] 2.752794 

[1] 2.752794 

$df 

$crit 

[1] 2.309193 11.545964 

$siglevel 

[1] 0.1002 

$tmeans 

[1] 8.000000 4.166667 

4.000000 6.000000 

$ehat 

[1] 0.5873188 

$etil 

[1] 0.7697309 

[1] 4.841391 


Output 13.7 

The output of rmanovab() (right-hand side of Output 13.7) tells us much the same 
things, but we get only a test statistic ( $teststat ) and the critical value for this statistic at a 
.05 level of significance ( $crit ). If the test statistic is significant atp < .05 (or whatever alpha 
you specified if you didn’t use the default value) then the test statistic should be greater 
than the critical value. In this case, the test statistic (2.75) is less than the critical value 
(4.84) indicating no significant differences in retch times after eating different animals, F = 
2.75, F _ = 4.84, p > .05. Both of these robust methods yield non-significant results (unlike 
the multilevel models). 

The post hoc tests for each analysis are conducted using the same command structure. 
Assuming you leave the default options, to run post hoc tests based on a 20% trimmed 
mean, execute: 7 

rmmcp(bushData2) 

To conduct post hoc tests based on trimmed means and a bootstrap: 
pairdepb(bushData2, nboot = 2000) 

Output 13.8 shows the post hoc tests based on trimmed means ( rmmcp). If the value of 
p.value is less than the critical value ( p.crit ) and the confidence interval does not cross zero 
then the comparison is significant. The columns labelled group tell you which groups are 
being compared (the numbers relate to the columns in the dataframe). 

• [1,] tests the difference between the stick insect and kangaroo testicle. This contrast is 
not significant because p.value (.014) is greater than p.crit (.010) and the confidence 
interval crosses zero. 


7 Obviously if you changed the level of trim for the main analysis you would need to do the same here. For 
example, for 15% trimmed means: 

rmanova(bushData2, tr = .15) 
rmmcp(bushData2, tr = .15) 
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• [2,] tests the difference between the stick insect and fish eye. This difference is not 
significant because p.value (.012) is greater than p.crit (.009) and the confidence 
interval crosses zero. 

• [3,] tests the difference between the stick insect and witchetty grub. This difference is 
not significant because p.value (.441) is greater than p.crit (.017) and the confidence 
interval crosses zero. 

• [4,] tests the difference between the kangaroo testicle and fish eye. This difference 
is not significant because p.value (1) is greater than p.crit (.05) and the confidence 
interval crosses zero. 

• [5,] tests the difference between the kangaroo testicle and witchetty grub. This dif¬ 
ference is not significant because p.value (.344) is greater than p.crit (.013) and the 
confidence interval crosses zero. 

• [6,] tests the difference between the fish eye and witchetty grub. This difference is 
not significant because p.value (.460) is greater than p.crit (.025) and the confidence 
interval crosses zero. 

We could report that there was no significant difference between the time to retch after 
eating a stick insect compared to a kangaroo testicle, 4 / = 3.67 (-0.48, 7.82), p > .05, fish 
eye T= 4.00 (-0.36, 8.36), p > .05, or witchetty grub 4'= 2.00 (-8.10, 12.10), p > .05; or 
a kangaroo testicle compared to a fish eye '!'= 0 (-5.39, 5.39), p > .05, or witchetty grub 
T = —1.83 (-9.23, 5.57), p > .05; or a fish eye compared to a witchetty grub 4'= —2.00 
(—12.55, 8.55), p > .05. Note that in each case I have reported psihat and its confidence 
interval. 


$test 

Group 

Group 

test 

p.value 

p.crit se 

[1,1 

1 

2 

3.7282016 

0.01359625 

0.01020 0.9834947 

[2, ] 

1 

3 

3.8733436 

0.01172054 

0.00851 1.0326995 

13, ] 

1 

4 

0.8355727 

0.44148206 

0.01690 2.3935678 

[4, ] 

2 

3 

0.0000000 

1.00000000 

0.05000 1.2769904 

[5, ] 

2 

4 

-1.0454201 

0.34371248 

0.01270 1.7536809 

[6, ] 

3 

4 

-0.8000000 

0.46001407 

0.02500 2.5000000 

$psihat 

Group 

Group 

psihat 

ci.lower 

ci.upper 

[1,1 

1 

2 

3.666667 

-0.4830017 

7.816335 

[2, ] 

1 

3 

4.000000 

-0.3572784 

8.357278 

13, ] 

1 

4 

2.000000 

-8.0992023 

12.099202 

[4, ] 

2 

3 

0.000000 

-5.3880170 

5.388017 

[5, ] 

2 

4 

-1.833333 

-9.2326553 

5.565989 

[6, ] 

3 

4 

-2.000000 - 

-12.5482728 

8.548273 

$con 

[, 1] 

[1,1 0 

$num.sig 
[1] 0 

Output 13.1 

3 





Output 13.9 shows the post hoc tests based on trimmed means and a bootstrap ( pairdepb). 
The interpretation of these results is similar to that for the trimmed means. If the value of 
test is greater than the critical value ( $crit ) and the confidence interval does not cross zero 
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then the contrast is significant. Therefore, we’re comparing each value of test against 4.98; 
as you can see, all values of test are smaller than this value and their confidence intervals 
cross zero, so we can conclude that none of the groups differ significantly. 

We could again report that (note that the values and confidence intervals for psihat 
have changed): there was no significant difference between the time to retch after eating 
a stick insect compared to a kangaroo testicle, 4'= 3.83 (-0.70, 8.37), p > .05, fish eye 
'h= 4.00 (—1.15, 9.15), p > .05, or witchetty grub 'k = 2.00 (-7.78, 11.78), p > .05; or a 
kangaroo testicle compared to a fish eye 4*= 0.17 (-7.27, 7.61), p > .05, or witchetty grub 
T = —1.83 (-9.76, 6.09), p > .05; or a fish eye compared to a witchetty grub 4'= -2.00 
(-12.90, 8.90), p > .05. 

[1] "Taking bootstrap samples. Please wait." 

$test 

Group Group test se 

[1,] 1 2 4.2097438 0.9105859 


[2, ] 

1 

3 

3.8729833 

1.0327956 


[3, ] 

1 

4 

1.0192944 

1.9621417 


[4, ] 

2 

3 

0.1116291 

1.4930394 


[5, ] 

2 

4 

-1.1527967 

1.5903354 


[6, ] 

3 

4 

-0.9144599 

2.1870833 


$psihat 

Group 

Group 

psihat 

ci.lower 

ci.upper 

[1,1 

1 

2 

3.8333333 

-0.7038692 

8.370536 

[2, ] 

1 

3 

4.0000000 

-1.1461402 

9.146140 

13, ] 

1 

4 

2.0000000 

-7.7768199 

11.776820 

[4, ] 

2 

3 

0.1666667 

-7.2727438 

7.606077 

[5, ] 

2 

4 

-1.8333333 

-9.7575433 

6.090877 

[6, ] 

3 

4 

-2.0000000 

-12.8976429 

8.897643 


$crit 

[1] 4.982729 

Output 13.9 



13.5. Effect sizes for repeated-measures designs ® 


As with independent ANOVA, the best measure of the overall effect size is omega squared 
(co 2 ). However, just to make life even more complicated than it already is, the equations 
we’ve previously used for omega squared can’t be used for repeated-measures data. If you 
do use the same equation on repeated-measures data it will slightly overestimate the effect 
size. For the sake of simplicity some people do use the same equation for one-way inde¬ 
pendent and repeated-measures ANOVAs (and I’m guilty of this in another book), but if 
you want to hit simplicity in the face with Stingy the particularly poison-ridden jellyfish, 
and embrace complexity like a particularly hot date, then the equation is (hang onto your 
hats): 



^(MSm-MSr) 

nk 


MS 1 MSb MSr 1 

\b-\ 

1 (\AQ 

m - MSr ) 

IVIOr + ^ + 

nk 


(13.1) 
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I know what you’re thinking, and it’s something along the lines of ‘are you having a 
laugh?’. Well, no, I’m not, but really the equation isn’t too bad if you break it down. 
First, there are some mean squares that we’ve come across before (and calculated before). 
There’s the mean square for the model (MS m ) and the residual mean square (MS R ) both 
of which we calculated earlier in the chapter. There’s also k, the number of conditions in 
the experiment, which for these data would be 4 (there were four animals), and there’s 
n, the number of people who took part (in this case, the number of celebrities, 8). The 
main problem is that we have all of these values from calculating everything by hand, but 
why would you ever calculate an ANOVA by hand except if you were writing a competing 
textbook? 

A practical solution is to use generalized eta-squared (Bakeman, 2005) because it is pro¬ 
duced by ezANOVA() in the column labelled ges. We can see from Output 13.2 that the 
value for Animal is .3274. 

I’ve mentioned at various other points that it’s actually more useful to have effect size 
measures for focused comparisons anyway (rather than the main ANOVA), and so a slightly 
easier approach to calculating effect sizes is to calculate them for the contrasts we did (see 
Output 13.5). We can use the equation that we’ve seen before to convert the t-values to r : 



Remember in section 10.7 we wrote a function to compute this called rcontrast(), which you 
should be able to use if you have the package associated with this book, DSUR, loaded - see 
section 3.4.5). We can use this function to calculate r for the contrasts we did by executing 
these commands (the values of t and df come from Output 13.5): 

rcontrast(3.149752, 21) 
rcontrast(-0.101237, 21) 
rcontrastC-1.923500, 21) 

The resulting values of r are 


[1] "r = 0.566434937677424" 

[1] "r = 0.0220863356562026" 

[1] "r = 0.387030341310243" 

which show that the difference between body parts and whole animals was a large effect 
(r = .57), between the stick insect and witchetty grub a medium effect (r = .39), but between 
the testicle and eyeball a very small effect (r = .02). 



13.6. Reporting one-way 
repeated-measures designs © 


What we report when we conduct repeated-measures ANOVA depends on how we do it. If 
you have used a traditional ANOVA approach (e.g., using ezANOVA) then you report the 
same details as with an independent ANOVA. The only additional thing we should concern 
ourselves with is reporting the corrected degrees of freedom if sphericity was violated. 
Personally, I’m also keen on reporting the results of sphericity tests as well. As with the 
independent ANOVA, the degrees of freedom used to assess the T-ratio are the degrees of 
freedom for the effect of the model ( df M = 1.60) and the degrees of freedom for the residu¬ 
als of the model (df K = 11.19). Remember that in this example we corrected both using the 
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Greenhouse-Geisser estimates of sphericity, which is why the degrees of freedom are as 
they are. Therefore, we could report the main finding as: 

^ Mauchly’s test indicated that the assumption of sphericity had been violated, / 2 (5) = 
11.41, p < .05, therefore Greenhouse-Geisser corrected tests are reported (e = .53). 
The results show that the time to retch was not significantly affected by the type of 
animal eaten, T(1.60, 11.19) = 3.79, p > .05, rf — . 'ill. 

Alternatively, we could report the Huynh-Feldt corrected values: 

S Mauchly’s test indicated that the assumption of sphericity had been violated, / 2 (5) = 
11.41, p < .05, therefore degrees of freedom were corrected using Huynh-Feldt esti¬ 
mates of sphericity (e = .67). The results show that the time to retch was significantly 
affected by the type of animal eaten, F( 2, 13.98) = 3.79, p < .05, rj 2 = . 327. 

If you have done a multilevel model then you would write your results differently (you 
could also put the results in a table as in section 19.8): 

J The type of animal consumed had a significant effect on the time taken to retch, x 2 (3) 
= 12.69, p = .005. Orthogonal contrasts revealed that retching times were signifi¬ 
cantly quicker for animal parts (testicle and eye) compared to whole animals (stick 
insect and witchetty grub), b = 1.38, t( 21) = 3.15, p = .005; there was no significant 
difference in the time to retch after eating a kangaroo testicle and a fish eye, b = 
-0.063, t( 21) =-0.101, p = .920, or between eating a witchetty grub or a stick insect, 
b = -1.188, t{ 21) = -1.924, p = .068. 



Labcoat Leni’s Real Research 13.1 


Who’s afraid of the big 
bad wolf? (D 


Field, A. R (2006). Journal of Abnormal Psychology, 115(4), 742-752. 


I'm going to let my ego get the better of me and talk about some of my own research. When I'm not scaring my 
students with statistics, I scare small children with Australian marsupials. There is a good reason for doing this, 
which is to try to discover how children develop fears (which will help us to prevent them). Most of my research 
looks at the effect of giving children information about animals or situations that are novel to them (rather like 
a parent, teacher or TV show would do). In one particular study (Field, 2006), I used three novel animals (the 
quoll, quokka and cuscus) and children were told negative things about one of the animals, positive things about 
another, and were given no information about the third (our control). I then asked the children to place their hands 
in three wooden boxes, each of which they believed contained one of the aforementioned animals. My hypothesis 
was that they would take longer to place their hand in the box containing the animal about which they had heard 
negative information. 

The data from this part of the study are in the file Field(2006).dat. Labcoat Leni wants you to carry out a one¬ 
way repeated-measures ANOVA on the times taken for children to place their hands in the three boxes (negative 
information, positive information, no information). First, draw an error bar graph of the means, then do some nor¬ 
mality tests on the data, then do a log transformation on the scores, and do the ANOVA on these log-transformed 
scores (if you read the paper, you’ll notice that I found that the data were not normal, so I log-transformed them 
before doing the ANOVA). Do children take longer to put their hands in a box that they believe contains 
an animal about which they have been told nasty things? 

Answers are in the additional material on the companion website (or look at page 748 in the original 
article). 
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13.7. Factorial repeated-measures designs © 

We have seen already that simple between-group designs can be extended to incorporate a 
second (or third) independent variable. It is equally easy to incorporate a second, third or 
even fourth independent variable into a repeated-measures analysis. 

There is evidence from advertising research that attitudes towards stimuli can be changed 
using positive imagery (e.g., Stuart, Shimp, & Engle, 1987). As part of an initiative to stop 
binge drinking in teenagers, the government funded some scientists to look at whether neg¬ 
ative imagery could be used to make teenagers’ attitudes towards alcohol more negative. 
The scientists designed a study to address this issue by comparing the effects of negative 
imagery against positive and neutral imagery for different types of drinks. Table 13.4 illus¬ 
trates the experimental design and contains the data for this example (each row represents 
a single participant). 

Participants viewed a total of nine mock adverts over three sessions. In one session, they 
saw three adverts: (1) a brand of beer (Brain Death) presented with a negative image (a 
dead body with the slogan ‘drinking Brain Death makes your liver explode’); (2) a brand 
of wine (Dangleberry) presented in the context of a positive image (a sexy naked man or 
woman - depending on the participant’s preference - and the slogan ‘drinking Dangleberry 
wine makes you irresistible’); and (3) a brand of water (Puritan) presented alongside a neu¬ 
tral image (a person watching television accompanied by the slogan ‘drinking Puritan water 
makes you behave completely normally’). In a second session (a week later), the participants 


Table 13.4 Data from Attitude.dat 


Drink 

Imagery 

+ve 

Beer 

-ve 

Neut 

+ve 

Wine 

-ve 

Neut 

+ve 

Water 

-ve 

Neut 

Male 

1 

6 

5 

38 

-5 

4 

10 

-14 

-2 


43 

30 

8 

20 

-12 

4 

9 

-10 

-13 


15 

15 

12 

20 

-15 

6 

6 

-16 

1 


40 

30 

19 

28 

-4 

0 

20 

-10 

2 


8 

12 

8 

11 

-2 

6 

27 

5 

-5 


17 

17 

15 

17 

-6 

6 

9 

-6 

-13 


30 

21 

21 

15 

-2 

16 

19 

-20 

3 


34 

23 

28 

27 

-7 

7 

12 

-12 

2 


34 

20 

26 

24 

-10 

12 

12 

-9 

4 


26 

27 

27 

23 

-15 

14 

21 

-6 

0 


1 

-19 

-10 

28 

-13 

13 

33 

-2 

9 

7 

-18 

6 

26 

-16 

19 

23 

-17 

5 

22 

-8 

4 

34 

-23 

14 

21 

-19 

0 

30 

-6 

3 

32 

-22 

21 

17 

-11 

4 

40 

-6 

0 

24 

-9 

19 

15 

-10 

2 

15 

-9 

4 

29 

-18 

7 

13 

-17 

8 

20 

-17 

9 

30 

-17 

12 

16 

-4 

10 

9 

-12 

-5 

24 

-15 

18 

17 

-4 

8 

14 

-11 

7 

34 

-14 

20 

19 

-1 

12 

15 

-6 

13 

23 

-15 

15 

29 

-1 

10 


Female 
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saw the same three brands, but this time Brain Death was accompanied by the positive imag¬ 
ery, Dangleberry by the neutral image and Puritan by the negative. In a third session, the 
participants saw Brain Death accompanied by the neutral image, Dangleberry by the nega¬ 
tive image and Puritan by the positive. After each advert participants were asked to rate the 
drinks on a scale ranging from —100 (dislike very much) through 0 (neutral) to 100 (like very 
much). The order of adverts was randomized, as was the order in which people participated 
in the three sessions. This design is quite complex. There are two independent variables: 
the type of drink (beer, wine or water) and the type of imagery used (positive, negative or 
neutral). These two variables completely cross over, producing nine experimental conditions. 



13.7.1. 


Entering the data © 


The data for the example can be found in the file Attitude.dat. You can load this data file 
by setting your working directory to the appropriate location and executing: 


attitudeData<-read.delim("Attitude.dat", header = TRUE) 


I have again structured the data in the format that you’d be most likely to use if you had 
entered the data in another software package and followed the usual conventions. The 
data have been entered in ‘wide’ format; that is, levels of the repeated-measures variable are 
spread across different columns. 

In this experiment there are nine experimental conditions and so the data have been 
entered in nine columns (so the format is identical to Table 13.4): 


beerpos 

Beer 

+ 

Sexy Person 

beerneg 

Beer 

+ 

Corpse 

beerneut 

Beer 

+ 

Person in Armchair 

winepos 

Wine 

+ 

Sexy Person 

wineneg 

Wine 

+ 

Corpse 

wineneut 

Wine 

+ 

Person in Armchair 

waterpos 

Water 

+ 

Sexy Person 

waterneg 

Water 

+ 

Corpse 

waterneut 

Water 

+ 

Person in Armchair 


There is also a column to indicate to which person each row of data belongs (called 

participant). 

As with the previous example, although the format of the data follows typical conven¬ 
tions, because of the way R handles repeated-measures designs we need the data to be in 
the long format. We can again do this using the melt() function. We specify columns in the 
data that identify characteristics of the scores (such as, from whom they originate) using 
the id option, and columns that identify the scores themselves using the measured option. 
In this case our scores are split over nine columns ( beerpos, beerneg, beerneut, winepos, 
wineneg, wineneut, waterpos, waterneg, waterneut), so these are our measured variables, 
and participant tells us from whom the scores originate so this is our id variable. We can 
create a new dataframe (called longAttitude) by executing: 

longAttitude <-melt(attitudeData, id = "participant", measured = c( "beerpos", 
"beerneg", "beerneut", "winepos", "wineneg", "wineneut", "waterpos", "waterneg", 
"waterneut")) 
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This dataframe contains three columns: the first identifies the participant, the second 
identifies the name of the column from which the data originate, and the third contains the 
attitude scores. By default, these columns will be named participant, variable, and value, 
which are not the most helpful of labels. Let’s rename these columns so that we actually 
know what they represent by executing: 

names(longAttitude)<-c("participant", "groups", "attitude") 

The variable groups is a mixture of our two predictor variables (imagery and type of 
drink). Note for example, that the first 60 rows are scores for the drink beer and within 
these 60 rows, 20 are positive imagery, 20 negative imagery and 20 neutral imagery. We 
therefore, need to create two variables that dissociate the type of imagery from the type of 
drink; these two variables will be the two predictors in our model. First, let’s create a vari¬ 
able called drink, which specifies whether beer, wine or water was in the advert. We can 
do this using th e gl() function that we used earlier in the chapter. Execute this command: 

longAttitude$drink<-gl(3, 60, labels = c("Beer", "Wine", "Water")) 

This creates a variable drink in the dataframe longAttitude. The numbers in the function 
tell R that we had three sets three sets, which correspond to the type of drink. Essentially, 
this will create 60 rows with the label Beer then 60 labelled Wine then 60 labelled Water. 

We also need a variable that tells us the type of imagery that was used. To do this we 
want three groups that each contain 20 scores. This will create 60 cases (3 x 20 = 60), or, 
put another way, it will create the codes for the first level (beer) of the drink variable. We 
want this pattern to be repeated for the remaining 2 levels of drink (i.e., wine and water). 
We can do this by adding a third value to the function that is the total number of cases (i.e., 
180). By specifying the total number of cases the gl() function will repeat the pattern of 
codes until it reaches this total number of cases 

longAttitude$imagery<-gl(3, 20, 180, labels = c("Positive", "Negative", 

"Neutral")) 

The data now look like this (edited and ordered by participant): 


1 

participant 

PI 

groups attitude drink imagery 
beerpos 1 Beer Positive 

21 

PI 

beerneg 

6 

Beer 

Negative 

41 

PI 

beerneut 

5 

Beer 

Neutral 

61 

PI 

winepos 

38 

Wine 

Positive 

81 

PI 

wineneg 

-5 

Wine 

Negative 

101 

PI 

wineneut 

4 

Wine 

Neutral 

121 

PI 

waterpos 

10 

Water 

Positive 

141 

PI 

waterneg 

-14 

Water 

Negative 

161 

PI 

waterneu 

-2 

Water 

Neutral 

10 

P10 

beerpos 

26 

Beer 

Positive 

30 

P10 

beerneg 

27 

Beer 

Negative 

50 

P10 

beerneut 

27 

Beer 

Neutral 

70 

P10 

winepos 

23 

Wine 

Positive 

90 

P10 

wineneg 

-15 

Wine 

Negative 

110 

P10 

wineneut 

14 

Wine 

Neutral 

130 

P10 

waterpos 

21 

Water 

Positive 

150 

P10 

waterneg 

-6 

Water 

Negative 

170 

P10 

waterneu 

0 

Water 

Neutrall 


Notice that each participant (identified by the participant variable) has nine scores (dis¬ 
tinguished by the variables drink and imagery). In the reformatted data the nine scores 
within each participant are now represented by nine different rows rather than nine col¬ 
umns as they were before. 
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SELF-TEST 

s Using what you learnt earlier in the chapter and the 
commands that we have just used to create drink 
and imagery, can you work out how to enter the 
data into R directly? 


13 . 7 . 2 . 


Exploring the data <D 


As ever, we’ll look at some graphs first. To save space we’ll look just at the boxplots at this 
stage. 



SELF-TEST 

s Use ggplot2 to plot boxplots of the attitude scores 
for each type of drink (x-axis) after adverts using 
different imagery (different plots). 


The resulting plot (Figure 13.6) shows that the median scores are highest (in general) after 
positive imagery, and lowest after negative. Scores are most spread out for beer (the box 
and whiskers are longest). 


FIGURE 13.6 

Boxplots of the 
attitude data 


Positive Negative Neutral 


40 - 

30 - 

20 - 

10 - 
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- 20 - 



Beer Wine Water 


Beer Wine Water 

Type of Drink 


Beer Wine 



We have previously used the by() function and the stat.desc() function in the pastecs 
package to get descriptive statistics for separate groups (see Chapter 5 for more detail). We 
also saw in the previous chapter that if we want to create descriptives for a combination 
of variables we can simply list all of the variables in the list() function; therefore, to get 
descriptive statistics for the combined levels of drink and imagery we execute: 

by(longAttitude$attitude, list(longAttitude$drink, longAttitude$imagery), 
stat.desc, basic = FALSE) 

The resulting (edited) output is in Output 13.10. From this table we can see that the vari¬ 
ability among scores was greatest when beer was used as a product (compare the standard 
deviations of the beer variables against the others). Also, when a corpse image was used 
(negative imagery), the ratings given to the products were negative (as expected) for wine 
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and water but not for beer (so for some reason, negative imagery didn’t seem to work when 
beer was used as a stimulus). The values in this table will help us later to interpret the main 
effects of the analysis. 

: Beer 
: Positive 

median mean SE.mean Cl.mean.0.95 var std.dev coef.var 
18.500 21.050 2.909 6.088 169.208 13.008 0.618 


: Wine 
: Positive 

median mean SE.mean Cl.mean.0.95 var std.dev coef.var 
25.000 25.350 1.507 3.153 45.397 6.738 0.266 


: Water 
: Positive 

median mean SE.mean Cl.mean.0.95 var std.dev coef.var 
17.000 17.400 1.582 3.311 50.042 7.074 0.407 


: Beer 
: Negative 

median mean SE.mean Cl.mean.0.95 var std.dev coef.var 
0.00 4.45 3.87 8.10 299.42 17.30 3.89 


: Wine 
: Negative 

median mean SE.mean Cl.mean.0.95 var std.dev coef.var 
-13.500 -12.000 1.382 2.893 38.211 6.181 -0.515 


: Water 
: Negative 

median mean SE.mean Cl.mean.0.95 var std.dev coef.var 
-10.000 -9.200 1.521 3.184 46.274 6.802 -0.739 


: Beer 
: Neutral 

median mean SE.mean Cl.mean.0.95 var std.dev coef.var 
8.00 10.00 2.30 4.82 106.00 10.30 1.03 


: Wine 
: Neutral 

median mean SE.mean Cl.mean.0.95 var std.dev coef.var 
12.500 11.650 1.396 2.922 38.976 6.243 0.536 


: Water 
: Neutral 

median mean SE.mean Cl.mean.0.95 var std.dev coef.var 
2.50 2.35 1.53 3.20 46.77 6.84 2.91 

Output 13.10 
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13 . 7 . 3 . 


Setting contrasts (D 


As with one-way repeated-measures designs, we need to set contrasts before proceeding to 
the main analysis. Just to remind you, these contrasts are important because (1) they help 
us to break down any main effects or interactions into more interpretable effects, and (2) 
if you use Type III sums of squares then you must first set (orthogonal) contrasts for these 
to be computed correctly. In this example, we need to set contrasts for both drink and 
imagery. 

For drink, we have two alcoholic drinks (beer and wine) and one non-alcoholic (water). 
The government is interested in preventing binge drinking, so water is an obvious control 
group. Therefore, our first contrast should compare the alcoholic drinks (beer and wine) 
to water (the control). We need a second contrast then to separate the beer and wine. 
Therefore, contrast 1 answers the question ‘are the effects different for alcoholic and non¬ 
alcoholic drinks?’ and contrast 2 answers the question ‘are the effects different for different 
types of alcoholic drink?’ The resulting codes are in Table 13.5. 


Table 13.5 

Orthogonal contrasts for the drink variable 


Group 

Contrast , 

Contrast 2 

Beer 

1 

-1 

Wine 

1 

1 

Water 

-2 

0 


To set these orthogonal contrasts (see Chapter 10) we can first create variables represent¬ 
ing each contrast (which is useful because you can give the contrasts informative names), 
and then bind these variables together and set them as the contrast for drink: 

AlcoholvsWater<-c(l, 1, -2) 

BeervsWine<-c(-l, 1, 0) 

contrasts(longAttitude$drink)<-cbind(AlcoholvsWater, BeervsWine) 

The first two commands each create a variable relating to a contrast that contains the codes 
for each group from Table 13.5. The final command sets these variables to be the contrasts 

for drink. 

For imagery, the government is interested in preventing binge drinking and so their main 
hypothesis is about whether negative imagery is effective compared to other forms usually 
used in advertising (i.e., positive or neutral). Therefore, our first contrast should compare 
negative imagery to other forms (positive and neutral combined). We need a second con¬ 
trast then to separate the positive and neutral imagery. Therefore, contrast 1 answers the 
question ‘are the effects different for negative imagery compared to other forms?’ and 
contrast 2 answers the question ‘are the effects different for positive and neutral imagery?’ 
The resulting codes are in Table 13.6. 


Table 13.6 Orthogonal contrasts for the imagery variable 


Group 

Contrasty 

Contrast 2 

Positive 

1 

-1 

Negative 

-2 

0 

Neutral 

1 

1 
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We can set these contrasts in the same way as for drink. We first create variables repre¬ 
senting each contrast, and then bind these variables together and set them as the contrast 
for imagery: 

NegativevsOther<-c(l, -2, 1) 

PositivevsNeutral<-c(-l, 0, 1) 

contrasts(longAttitude$imagery)<-cbind(NegativevsOther, PositivevsNeutral) 

The first two commands each create a variable relating to a contrast that contains the 
codes for each group from in Table 13.6. The final command sets these variables to be the 
contrasts for drink. 

We can check that we have set the contrast correctly by executing the name of the vari¬ 
able and looking at the contrast attribute: 

longAttitude$drink 

attr(,"contrasts") 

AlcoholvsWater BeervsWine 
Beer 1 -1 

Wine 1 1 

Water -2 0 

Levels: Beer Wine Water 

longAttitude$imagery 

NegativevsOther PositivevsNeutral 
Positive 1 -1 

Negative -2 0 

Neutral 1 1 

Levels: Positive Negative Neutral 

Remembering that positive numbers are compared with negative and a zero means that 
the group is not involved at all, we can see that for drink, contrast 1 compares water to 
the other drinks and contrast 2 compares wine and beer; for imagery, contrast 1 compares 
negative imagery to other forms, and contrast 2 compares positive and neutral imagery. 


13 . 7 . 4 . 


Factorial repeated-measures ANOVA (D 


As with one-way repeated-measures ANOVA, we can do a fairly easy analysis using the 
ezANOVAQ function. Earlier in the chapter we saw that all we need to do is to specify our 
repeated-measures predictor within the option labelled within =.(). In the previous example, 
we specified a single variable within the brackets, but when we have several predictors we 
can simply list the predictors separated by commas. The rest of the function is similar to the 
previous example. For these data, therefore, we would execute these commands: 

attitudeModel<-ezANOVA(data = longAttitude, dv = .(attitude), wid = .(par¬ 
ticipant), within = .(imagery, drink), type = 3, detailed = TRUE) 
attitudeModel 

This creates a model ( attitudeModel) from our dataframe ( longAttitude). Note that we 
have set the variable attitude as the outcome (dv = .(attitude)), we have told the function 
that participants can be identified by the variable participants ( wid = .(participant)), the 
predictors are drink and imagery (within = .(imagery, drink)), that we want Type III sums 
of squares (type = 3), and that we want to see the sums of squares in the output (detailed 
= TRUE). Executing the second command (attitudeModel) simply prints the model to the 
console. 
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$ANOVA 



Effect 

DFn 

DFd 

SSn 

SSd 

F 

p p<.05 

ges 

1 

(Intercept) 

1 

19 

11218 

1920 

111.01 

2.26e-09 * 

0.413 

2 

drink 

2 

38 

2092 

7786 

5.11 

1.09e-02 * 

0.116 

3 

imagery 

2 

38 

21629 

3353 

122.56 

2.68e-17 * 

0.575 

4 

drink:imagery 

4 

76 

2624 

2907 

17.15 

4.59e-10 * 

0.141 


$'Mauchly's Test for Sphericity' 

Effect W p p<.05 

2 drink 0.267 6.95e-06 * 

3 imagery 0.662 2.45e-02 * 

4 drink:imagery 0.595 4.36e-01 


$'Sphericity Corrections' 

Effect GGe p[GG] p[GG]<.05 

2 drink 0.577 2.98e-02 * 

3 imagery 0.747 1.76e-13 * 

4 drink:imagery 0.798 1.90e-08 * 


HFe p[HF] p[HF]<.05 
0.591 2.88e-02 * 
0.797 3.14e-14 * 
0.979 6.81e-10 * 


Output 13.11 

Output 13.11 shows the results from ezANOVA(). Mauchly’s sphericity test (see section 
13.2.3) is computed for each of the three effects in the model (two main effects and one 
interaction). The significance values of these tests indicate that both the main effects of 
drink (p < .001) and imagery (p = .025) have violated this assumption and so the F-values 
should be corrected (see Jane Superbrain Box 13.3). For the interaction the assumption of 
sphericity is met (because p = .436) and so we need not correct the T-ratio for this effect. 

Output 13.11 also shows the results of the ANOVA (and whether each effect is signifi¬ 
cant after correcting the F -values for sphericity). Looking at the significance values, it is 
clear that there is a significant effect of the type of drink used as a stimulus (sphericity 
was violated, so for this effect we look at p[GG] or p[HF], both of which are significant 
because they are less than .05), a significant main effect of the type of imagery used (again 
sphericity was violated, so we look at p[GG] or p[F[F], both of which are significant), and 
a significant interaction between these two variables (sphericity was not violated, so we 
can look at p in the table labelled $ANOVA). I will examine each of these effects in turn. 


13.7.4.1. The effect of drink (D 


Output 13.11 told us that the effect of the type of drink used in the advert was significant. 
For this effect we looked at one of the corrected significance values because sphericity was 
violated (see above). Both of the corrected values were significant and so we may as well 
report the conservative Greenhouse-Geisser corrected values of the degrees of freedom. 
This effect tells us that if we ignore the type of imagery that was used, participants still 
rated some types of drink significantly differently. 

To interpret this effect we should plot the means and look at some descriptive statistics. 




SELF-TEST 

s Using ggplot2 and stat.desc, plot an error bar graph 
and get the means for the main effect of drink. 
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FIGURE 13.7 
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Output 13.12 shows the means for the main effect of drink . 8 Figure 13.7 uses this infor¬ 
mation to display the means for each condition. It is clear from this graph that beer and 
wine were rated higher than water (with beer being rated most highly). To see the nature 
of this effect we could look at the post hoc tests (see below) and the contrasts. However, 
because we have a significant interaction effect it does not make sense (despite the fact that 
Type III sums of squares make you think that it does) to interpret this main effect because 
it is superseded by the interaction with imagery. 

longAttitude$drink: Beer 


median mean SE.mean 

12.50 11.83 1.97 

Cl.mean.0.95 

3.95 

var 

233.46 

std.dev 

15.28 

coef.var 

1.29 

longAttitude$drink: Wine 
median mean SE.mean 

12.00 8.33 2.17 

Cl.mean.0.95 

4.33 

var 

281.51 

std.dev 

16.78 

coef.var 

2.01 

longAttitude$drink: Water 
median mean SE.mean 

3.50 3.52 1.67 

Cl.mean.0.95 

3.34 

var 

166.69 

std.dev 

12.91 

coef.var 

3.67 


Output 13.12 


13.7.4.2. The effect of imagery (D 

Output 13.11 also indicates that the effect of the type of imagery used in the advert had a 
significant influence on participants’ ratings of the stimuli. Again, we must look at one of 
the corrected significance values because sphericity was violated (see above). Both of the 


8 These means are obtained by taking the average of the means in Output 13.10 for a given condition. For 
example, the mean for the beer condition (ignoring imagery) is 

_^-Beer + Sexy ~^^Beer + Corpse"^^Beer + Neutral 21.05 +4.45 + 10.00 Q ^ 

A Beer“ 


3 


3 
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corrected values are highly significant and so we can again report the Greenhouse-Geisser 
corrected values of the degrees of freedom. This effect tells us that if we ignore the type 
of drink that was used, participants’ ratings of those drinks were different according to the 
type of imagery that was used. 

To interpret this effect we should plot the means and look at some descriptive statistics. 




SELF-TEST 

s Using ggplot2 and stat.desc, plot an error bar graph 
and get the means for the main effect of imagery. 


Output 13.13 shows the means for the main effect of imagery. Figure 13.8 uses this 
information to illustrate the means for each condition. It is clear from this graph that 
positive imagery resulted in very positive ratings (compared to the neutral imagery) and 
negative imagery resulted in negative ratings (especially compared to the effect of neutral 
imagery). Remember that because we have a significant interaction effect it does not make 
sense to interpret this main effect because it is superseded by the interaction with drink. 

longAttitude$imagery: Positive 


median 

20.500 

mean 

21.267 

SE.mean 

1.265 

Cl.mean.0.95 

2.531 

var 

95.962 

std.dev 

9.796 

coef. 

0.461 

var 

longAttitude$imagery: Negative 





median 

mean 

SE.mean 

Cl.mean.0.95 

var 

std.dev 

coef. 

var 

-9.00 

-5.58 

1.71 

3.43 

176.15 

13.27 

-2.38 


longAttitude$imagery: Neutral 





median 

mean 

SE.mean 

Cl.mean.0.95 

var 

std.dev 

coef. 

var 

7.00 

8.00 

1.14 

2.29 

78.44 

8.86 

1.11 



Output 13.13 


FIGURE 13.8 

Error bar graph of 40 _ 

the main effect of 

imagery 

30- 


<D 

TJ 

3 


20 - 


< 10 - 


0 - 



- 10 - 


- 20 - 


Positive 


Negative 

Type of Imagery 


Neutral 











CHAPTER 13 REPEATED-MEASURES DESIGNS (GLM 4) 


593 


13.7.4.3. The interaction effect (drink x imagery) (D 


Output 13.11 indicated that imagery interacted in some way with the type of drink used 
as a stimulus. From the output we should report that there was a significant interaction 
between the type of drink used and imagery associated with it, F(4, 76) = 17.16, p < .001. 
This effect tells us that the type of imagery used had a different effect depending on which 
type of drink it was presented alongside. We can use the means in Output 13.10 to deter¬ 
mine the nature of this interaction. 



SELF-TEST 

s Using ggplot2 , plot a line graph with error bars of the 
means for the drink x imagery interaction. 



40 - 



Type of Imagery 

Positive 

Negative 

Neutral 


FIGURE 13.9 

Interaction graph 
for the attitude 
data. The three 
lines represent the 
type of imagery: 
positive imagery 
(light blue), 
negative imagery 
(black) and neutral 
imagery (dark 
blue) 
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Figure 13.9 shows the interaction graph, and we are looking for non-parallel lines. 
The graph shows that the pattern of responding across drinks was similar when positive 
and neutral imagery were used. That is, ratings were positive for beer, they were slightly 
higher for wine and then they went down slightly for water. The fact that the line repre¬ 
senting positive imagery is higher than the neutral line indicates that positive imagery gave 
rise to higher ratings than neutral imagery across all drinks. The bottom line (representing 
negative imagery) shows a different effect: ratings were lower for wine and water but not 
for beer. Therefore, negative imagery had the desired effect on attitudes towards wine and 
water, but for some reason attitudes towards beer remained fairly neutral. Therefore, the 
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interaction is likely to reflect the fact that negative imagery has a different effect than both 
positive and neutral imagery (because it decreases ratings rather than increasing them). 
This interaction is completely in line with the experimental predictions. To verify the 
interpretation of the interaction effect, we can look at some post hoc tests or the contrasts 
(we’ll get to those later). 

To get post hoc tests for the interaction term we need to use a variable that combines 
imagery and drink into a single coding variable. Fortunately we already have such a vari¬ 
able in the data set; it’s called groups and it was created when we converted our original 
data set from wide format to long. We can use this variable in the pairwise.t.testQ function 
to get comparisons between all nine groups that the interaction term encompasses. The 
format of this option is exactly the same as we used earlier in the chapter, and we execute: 

pairwise.t.test(longAttitude$attitude, longAttitude$groups, paired = TRUE, 
p.adjust.method = "bonferroni") 

Output 13.14 shows the results of the post hoc tests. We can see (in bold in the output) 
that for beer there are significant differences between positive imagery and both negative 
(p = .002) and neutral (p = .020), but not between negative and neutral (p = 1.00); for wine, 
there are significant differences between positive imagery and both negative (p < .001) and 
neutral (p < .001), and between negative and neutral ( p < .001); and for water, there are 
significant differences between positive imagery and both negative (p < .001) and neutral 
(p < .001), and between negative and neutral ( p < .001). These findings support our earlier 
conclusion that beer is unusual in that negative imagery does appear to reduce attitudes 
compared to neutral imagery. 

Pairwise comparisons using paired t tests 
data: longAttitude$attitude and longAttitude$groups 


beerpos beerneg beerneut winepos wineneg wineneut waterpos waterneg 


beerneg 

0.00217 

- 

- 

- 

- 

- 

- 

beerneut 

0.01982 

1.00000 

- 

- 

- 

- 

- 

winepos 

1.00000 

0.01105 

0.00310 

- 

- 

- 

- 

wineneg 

5.6e-08 

0.00265 

2.0e-07 

1.9e-10 

- 

- 

- 

wineneut 

0.39905 

1.00000 

1.00000 

2 . 2e-05 

2.3e-07 

- 

- 

waterpos 

1.00000 

0.47584 

1.00000 

0.07300 

1.3e-09 

0.10547 

- 

waterneg 

2.9e-06 

0.18860 

0.00010 

3.2e-10 

1.00000 

l.le-07 

4 .9e-ll - 

waterneut 

0.00212 

1.00000 

0.74838 

4.3e-10 

0.00041 

8.le-05 

9.Oe-07 0.00068 


P value adjustment method: bonferroni 

Output 13.14 


13.7.5. 


Factorial repeated-measures designs as a GLM (D 


Earlier in the chapter we used lme(), which looks at repeated-measures data in a linear 
model. I have already outlined various advantages to this method. It is a simple matter to 
extend what we have already learnt to a situation that includes more than one predictor 
or independent variable. All we do is extend the model to include each predictor and any 
interactions. We saw earlier that if we want to look at individual effects then we should 
build up the model from a baseline that includes no predictors other than the intercept. 

baseline<-lmeCattitude ~ 1, random = ~11 participant/drink/imagery, data = 
longAttitude, method = "ML") 
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Compare this model with the one that we used for the one-way repeated-measures design 
earlier in the chapter. We have specified the model as the outcome predicted only from the 
intercept ( attitude ~ 1), specified the relevant dataframe {data = longAttitu.de) , and asked 
to use maximum likelihood to estimate the model {method = “ML”). The main thing that 
has changed is the random part of the model, which is slightly more complex than before 
to reflect the fact that there are now two predictors. The random part of model {random 
= —llparticipantldrink/imagery) simply tells R that the variables drink and imagery are 
nested within the variable participant (in other words, scores for levels of these variables 
can be found within each participant). Execute the above command to create the baseline 
model. 

If we want to see the overall effect of each predictor then we need to add them one at 
a time. To add drink to the model we could just change the model from attitude ~ 1 to 
attitude ~ drink. In other words, execute: 

baselinec-lmeCattitude ~ drink, random = ~11 participant/drink/imagery, data 
= longAttitude, method = "ML") 

However, it is quicker to use the update() function (see R’s Souls’ Tip 7.2): 
drinkModel<-update(baseline, + drink) 

This command takes the model called baseline (which we have already created), and the 
. — . means keep the outcome and predictors the same as the baseline model (the dots mean 
‘keep the same’, so the fact that we put dots on both sides of the — means that we want 
to keep both the outcome and predictors the same as in the baseline model). The ‘+ drink’ 
means ‘add drink as a predictor’. Therefore, + drink’ can be interpreted as ‘keep 
the same outcomes and predictors as the baseline model but add drink as a predictor’. 
Executing this command creates a model called drinkModel that includes only drink as a 
predictor. 

In a similar way we can add imagery to the model as a predictor. 
imageryModel<-update(drinkModel, + imagery) 

This command takes the model called drinkModel (which we have just created), as before, 
the . — . means keep the outcome and predictors the same as in drinkModel and the ‘+ 
imagery’ adds imagery as a predictor. Therefore, + imagery’ can be interpreted as 
‘keep the same outcomes and predictors as drinkModel but add imagery as a predictor’. 
Executing this command creates a model called imageryModel that includes both drink and 
imagery as predictor. 

Finally, we can add the interaction term by executing: 
attitudeModel<-update(imageryModel, + drink:imagery) 

This command takes the model called imageryModel (which we have just created), as 
before, the . — . means keep the outcome and predictors the same as in imageryModel, and 
the ‘+ drinkbmagery’ adds the drink x imagery interaction as a predictor. Executing this 
command, therefore, creates a model called attitudeModel that includes the main effects of 
drink and imagery as well as their interaction as predictors. 

To compare these models we can list them in the order in which we want them compared 
in the anova() function (see 7.8.4.2): 

anovafbaseline, drinkModel, imageryModel, attitudeModel) 

Executing the above command produces Output 13.15, which first compares the effect 
of drink to the baseline (i.e., no predictors). By adding drink as a predictor we increase 
the degrees of freedom by 2 (the two contrasts that we used to code this variable) and 
significantly improve the model. In other words, the type of drink had a significant effect 
on attitudes, x 2 (2) = 9.1, p = .010. Next, we see the effect of adding the main effect of 
imagery into the model (compared to the previous model that contained only the effect of 
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drink). Again the degrees of freedom are increased by 2 (the two contrasts used to code 
this variable) and the fit of the model is significantly improved; the type of imagery used 
in the advert had a significant effect on attitudes, x 2 (2) = 151.9, p < .001. The final model 
(which includes both main effects and the interaction between them) is then compared 
to the previous model (that includes only the two main effects). The interaction term is 
made up of four contrasts (the number of contrasts for each variable in the interaction 
multiplied) and significantly improves the model fit; therefore, attitudes were significantly 
affected by the combined effect of the type of drink and type of imagery, x 2 (4) = 42.0, 
p < .001. These results confirm the overall effects that we looked at with ezANOVA() in the 
previous section, and you should look back at that section to remind yourself of how we 
interpreted these effects. 


Model 


baseline 1 
drinkModel 2 
imageryModel 3 
attitudeModel 4 

Output 13.15 


We can further expl 


df AIC BIC logLik 

5 1504 1520 -747 

7 1498 1521 -742 1 

9 1351 1379 -666 2 

13 1317 1358 -645 3 


the model by executing: 


Test L.Ratio p-value 

vs 2 9.1 0.0104 
vs 3 151.9 <.0001 
vs 4 42.0 <.0001 


summary(attitudeModel) 


Output 13.16 shows the parameter estimates for the model (I’ve edited some of the names 
to save space). Most important, these include the parameters for the contrasts that we set 
for each variable. First, we get the two contrasts for drink, which show a significant effect 
on attitudes when comparing alcoholic drinks to water, b = 2.19, t( 38) = 3.18, p = .003, 
but not when comparing beer with wine b = -1.75, t( 38) = -1.47, p = .150. next, we get 
the two contrasts for imagery, which show a significant effect on attitudes when comparing 
negative imagery to other types, b = 6.74, T(114) = 17.26, p < .001, and when comparing 
positive to neutral imagery, b = -6.63, ^(114) = -9.81, p < .001. The next four effects are 
the contrasts for the interaction term and we’ll look at these in turn. 


Linear mixed-effects model fit by maximum likelihood 
Data: longAttitude 
AIC BIC logLik 
1317 1358 -645 


Fixed effects: attitude ~ drink + imagery + drink:imagery 



Value 

Std.Error 

DF 

t-value 

p-value 

(Intercept) 

7.89 

0.973 

114 

8.12 

0.0000 

AlcoholvsWater 

2.19 

0.688 

38 

3.18 

0.0029 

BeervsWine 

-1.75 

1.191 

38 

-1.47 

0.1500 

NegativevsOther 

6.74 

0.391 

114 

17.26 

0.0000 

PositivevsNeutral 

-6.63 

0.676 

114 

-9.81 

0.0000 

AlcoholvsWater:NegativevsOther 

0.19 

0.276 

114 

0.69 

0.4922 

BeervsWine:NegativevsOther 

3.24 

0.478 

114 

6.77 

0.0000 

AlcoholvsWater:PositivevsNeut 

0.45 

0.478 

114 

0.93 

0.3533 

BeervsWine:positivevsNeut 

-0.66 

0.828 

114 

-0.80 

0.4256 


Output 13.16 


13.7.5.1. Alcohol vs. water, negative vs. other imagery © 

The first interaction term looks at the effect of alcoholic drinks (i.e., wine and beer com¬ 
bined) relative to water when comparing negative imagery with other types of imagery (i.e., 
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positive and neutral combined). This contrast is non-significant. This result tells us that the 
decreased liking found when negative imagery is used (compared to other forms) is the same 
for both alcoholic drinks and water. The top left panel of Figure 13.10 shows the means 
being compared. The gap between the lines, which represents the effect of negative imagery 
compared to other forms, is roughly the same for alcoholic drinks and water. This finding 
indicates that the effect of negative imagery (compared to other forms) in lowering attitudes 
is comparable in alcoholic and non-alcoholic drinks, b = 0.19, 7(114) = 0.69, p = .492. 

13.7.5.2. Beer vs. wine, negative vs. other imagery © 

The second interaction term looks at whether the effect of negative imagery compared 
to other types of imagery (i.e., positive and neutral combined) is comparable in beer and 
wine. This contrast is significant. This result tells us that the decreased liking found when 
negative imagery is used (compared to other forms) is different in beer and wine. The top 
right panel of Figure 13.10 shows the means being compared. The gap between the lines, 
which represents the effect of negative imagery compared to other forms, is much bigger 
for wine than it is for beer. This finding suggests that the effect of negative imagery (com¬ 
pared to other forms) in lowering attitudes to beer was significantly smaller than for wine, 
b = 3.24,7(114) = 6.77, p < .001. 


Contrast 1 





Q 

« 20 - 


Imagery 

Control 

Negative 


Contrast 2 



FIGURE 13.10 

Graphs to show 
the four contrasts 

in the drink 
x imagery 

interaction 


Imagery 

Control 

Negative 


Alcohol _ , „ . . Water 

Type of Drink 


Beer _ , „ . , Wine 

Type of Drink 



Alcohol 


Type of Drink 


Type of Drink 
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13.7.5.3. Alcohol vs. water, positive vs. neutral imagery <D 

The third interaction term looks at whether the effect of positive imagery (compared 
to neutral) is comparable in alcoholic drinks (i.e., wine and beer combined) relative to 
water. This contrast is non-significant. This result tells us that the increased liking found 
when positive imagery is used (compared to neutral) is similar for both alcoholic drinks 
and water. The bottom left panel of Figure 13.10 shows the means being compared. The 
distance between the lines, which represents the effect of positive imagery compared to 
neutral, is roughly the same for beer as it is for wine. This finding suggests that positive 
imagery has a similar effect in increasing attitudes (compared to neutral imagery) in both 
alcoholic and non-alcoholic drinks, b = 0.45, ^(114) = 0.93, p = .353. 

13.7.5.4. Beer vs. wine, positive vs. neutral imagery © 

The final interaction term looks at whether the effect of positive imagery compared to 
neutral is comparable in beer and wine. This contrast is not significant. This result tells us 
that the increased liking found when positive imagery is used (compared to neutral) is com¬ 
parable in beer and wine. The bottom right panel of Figure 13.10 shows the means being 
compared. Note that the distance between the lines (i.e., the effect of positive imagery 
compared to neutral) is roughly the same for beer as it is for wine. In summary, the effect of 
positive imagery (compared to neutral) in increasing attitudes to beer was not significantly 
different to that for wine, b = -0.66, £(114) = -0.80, p = .426. 

13.7.5.5. Limitations of these contrasts © 



These contrasts, by their nature, tell us nothing about the differences between water 
and beer and wine separately, or the effect of negative imagery compared to, say, neutral 
imagery alone. If you need more comparisons, you could run post hoc tests (as explained 
earlier in the chapter). 

Although it may seem tiresome to spend so long interpreting an analysis so thoroughly, 
you are well advised to take such a systematic approach if you want to truly understand the 


CRAMMING SAM’S TIPS 


Two-way repeated-measures ANOVA 


• Two-way repeated-measures ANOVA compares several means when there are two independent variables, and the same 
participants have been used in all experimental conditions. 

• We recommend treating your data as a multilevel model (i.e., repeated-measures regression). However, if you don’t then 
test the assumption of sphericity when you have three or more repeated-measures conditions. Test for sphericity using 
Mauchly’s test. If the p-value is less than .05 then the assumption is violated. If the significance of Mauchly's test is greater 
than .05 then the assumption of sphericity has been met. You should test this assumption for all effects (in a two-way ANOVA 
this means you test it for the effect of both variables and the interaction term). 

• In a two-way ANOVA you will have three effects: a main effect of each variable and the interaction between the two. For each 
effect, if the assumption of sphericity has been met then use the p-value for the main ANOVA. If the assumption was violated 
then read the p-value corrected using either the Greenhouse-Geisser (p[GG]) or Huynh-Feldt (p[HF']) estimate of sphericity 
(read this chapter to find out the relative merits of the two procedures). If the p-value is less than .05 then the means of the 
groups are significantly different. 

• Break down the main effects and interaction terms using contrasts and post hoc tests; again look to the p-values to discover 
if your comparisons are significant (they will be if the significance value is less than .05). 
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effects that you obtain. Interpreting interaction terms is complex, and I can think of a few 
well-respected researchers who still struggle with them, so don’t feel disheartened if you 
find them hard. Try to be thorough, and break each effect down as much as possible using 
contrasts, and hopefully you will find enlightenment. 


13.7.6. 


Robust factorial repeated-measures ANOVA (D 


At the time of writing there aren’t any functions (that I can find) that deal with factorial 
repeated-measures designs. This is a blessing for me (I don’t have to work out how to 
use them and then write about them) but a curse for you (if you happen to have misbe¬ 
having data). 


13.8. Effect sizes for factorial 
repeated-measures designs ® 


Calculating omega squared for one-way repeated-measures ANOVA was hair-raising 
enough, so we’ll definitely leave it alone for factorial designs (you can read this as ‘I don’t 
know how to do it’). However, ezANOVA produces generalized eta-squared (Output 
13.11); in this case the values were .575 for the main effect of imagery, .116 for the main 
effect of drink and .141 for the interaction term. This shows a relatively strong effect of 
imagery, but fairly modest effects of drink and the interaction. 

As I keep saying, effect sizes are really more useful when they describe a focused effect, 
so I’d advise calculating effect sizes for your contrasts when you’ve got a factorial design 
(and any main effects that compare only two groups). We can use the rcontrast() function 
that we used for one-way ANOVA: simply input each value of t and its associated degrees 
of freedom for each of the eight effects in Output 13.16 (remember that you have to load 
the DSUR package first). For drink there were two contrasts: alcohol vs. water 



> rcontrast(3.18, 38) 

[1] "r = 0.458457001137587 


and beer vs. wine 


> rcontrast(-1.47, 38) 

[1] "r = 0.231961343984559" 

For imagery we contasted negative vs. other 

> rcontrast(17.26, 114) 

[1] "r = 0.850434536664415" 

and positive vs. neutral 

> rcontrast(-9.81, 114) 

[1] "r = 0.676574089263451" 

Finally, we had four interaction contrasts: alcohol vs. water with negative vs. other imagery 

> rcontrast(0.69 , 114) 

[1] "r = 0.0644898962213597" 


beer vs. wine with negative vs. other imagery 

> rcontrast(6.77 , 114) 

[1] "r = 0.535495195928629" 
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alcohol vs. water with positive vs. neutral imagery 

> rcontrast(0.93 , 114) 

[1] "r = 0.0867739323982253" 


and beer vs. wine with positive vs. neutral imagery 



> rcontrast(-0.80, 114) 

[1] "r = 0.0747174253416562" 

As such, the effect in the interaction that was significant (beer vs. wine, negative vs. other 
imagery) yields a fairly large effect size. The remaining effects in the interaction, which 
were not significant, yield fairly small effect sizes (all under .1). 


13.9. Reporting the results from factorial 
repeated-measures designs © 


As we saw before, how you report repeated-measures ANOVA depends on how you do it. 
If you have used a traditional ANOVA approach (e.g., using ezANOVA) then report it as 
you would any factorial ANOVA: remember that you’ve got three effects to report, and 
these effects might have different degrees of freedom. For the main effects of drink and 
imagery, the assumption of sphericity was violated so we’d have to report the Greenhouse- 
Geisser corrected degrees of freedom. We can, therefore, begin by reporting the violation 
of sphericity: 

S Mauchly’s test indicated that the assumption of sphericity had been violated for the 
main effects of drink, W = 0.267, p < .001, e = .58, and imagery, W = 0.662, p < .05, 
s = .75. Therefore degrees of freedom were corrected using Greenhouse-Geisser 
estimates of sphericity. 

We can then report the three effects from this analysis as follows: 

'A All effects are reported as significant at p < .05. There was a significant main effect of 
the type of drink on ratings of the drink, F(1.15, 21.93) = 5.11. 

•A There was also a significant main effect of the type of imagery on ratings of the 
drinks, F(1.50, 28.40) = 122.57. 

^ There was a significant interaction effect between the type of drink and the type of 
imagery used, F(4, 76) = 17.16. This indicates that imagery had different effects on 
people’s ratings, depending on which type of drink was used. Bonferroni post hoc 
tests revealed that for beer there were significant differences between positive ima¬ 
gery and both negative (p = .002) and neutral (p = .020), but not between negative 
and neutral (p = 1.00); for wine, there were significant differences between positive 
imagery and both negative (p < .001) and neutral (p < .001), and between negative and 
neutral (p < .001); and for water, there were significant differences between positive 
imagery and both negative (p < .001) and neutral (p < .001), and between negative 
and neutral (p < .001). These findings suggest that beer is unusual in that negative 
imagery does appear to reduce attitudes compared to neutral imagery. 

If you have done a multilevel model then you would write your results differently (you 
could also put the results in a Table as in section 19.8): 
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^ The type of drink had a significant effect on attitudes, / 2 (2) = 9.1, p = .010, as did 
the type of imagery used in the advert, x 2 (2) = 151.9, p < .001. Most important, the 
drink x imagery interaction was significant, / 2 (4) = 42.0, p < .001. Contrasts revealed 
that (1) the effect of negative imagery (compared to other forms) in lowering atti¬ 
tudes is comparable in alcoholic and non-alcoholic drinks, b = 0.19, ?(114) = 0.69, 
p = .492; (2) the effect of negative imagery (compared to other forms) in lowering 
attitudes to beer was significantly smaller than for wine, b = 3.24, T(114) = 6.77, 
p < .001; (3) positive imagery has a similar effect in increasing attitudes (compared to 
neutral imagery) in both alcoholic and non-alcoholic drinks, b = 0.45, T(114) = 0.93, 
p = .353; and (4) the effect of positive imagery (compared to neutral) in increasing 
attitudes to beer was not significantly different from that for wine, b = -0.66, t(114) 
= -0.80, p = , 426. 



What have I discovered about statistics? © 


This chapter has helped us to walk through the murky swamp of repeated-measures 
designs. We discovered that is was infested with rabid leg-eating crocodiles. The first 
thing we learnt was that with repeated-measures designs there is yet another assumption 
to worry about: sphericity. Having recovered from this shock revelation, we were fortu¬ 
nate to discover that this assumption, if violated, can be easily remedied. It can also be 
remedied by doing a multilevel model; not so easy, but as rewarding as a cocktail on a hot 
beach. We then moved on to look at the theory of repeated-measures ANOVA for one 
independent variable. Although not essential by any stretch of the imagination, this was a 
useful exercise to demonstrate that basically it’s exactly the same as when we have an inde¬ 
pendent design (well, there are a few subtle differences, but I was trying to emphasize the 
similarities). We then worked through an example using R, before tackling the particularly 
foul-tempered, starving hungry, and mad as Stabby the mercury-sniffing hatter, piranha 
fish of omega squared. That’s a road I kind of regretted going down after I’d started, but, 
stubborn as ever, I persevered. This led us ungracefully on to factorial repeated-meas¬ 
ures designs and, specifically, the situation where we have two independent variables. We 
learnt that, as with other factorial designs, we have to worry about interaction terms. But, 
we also discovered some useful ways to break these terms down using contrasts. I kept 
going on about multilevel models a lot as well, which is a topic to which we shall return, 
but not until I’ve taken a three-month trip to the aforementioned hot beach. 

By 16 I had started my first ‘serious’ band. We actually stayed together for about 
7 years (with the same line-up, and we’re still friends now) before Mark (drummer) 
moved to Oxford, I moved to Brighton to do my Ph.D., and rehearsing became a mam¬ 
moth feat of organization. We had a track on a CD, some radio play and transformed 
from a thrash metal band to a blend of Fugazi, Nirvana and metal. I never split my trou¬ 
sers during a gig again (although I did once split my head open). Why didn’t we make 
it? Well, Mark was an astonishingly good drummer so it wasn’t his fault, the other Mark 
was an extremely good bassist too (of the three of us he is the one that has always been 
in a band since we split up), so the weak link was me. This was especially unfortunate 
given that I had three roles in the band (guitar, singing, songs) - my poor band mates 
never stood a chance.© I stopped playing music for quite a few years after we split. I still 
wrote songs (for personal consumption) but the three of us were such close friends that 
I couldn’t bear the thought of playing with other people. At least not for a few years ... 
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R packages used in this chapter 


compute.es 

nlme 

ez 

pastecs 

ggplot2 

reshape 

multcomp 

WRS 

R functions used in this chapter 

Anova() 

lme() 

aov() 

melt() 

byO 

pairdepb() 

cast() 

rmanovaO 

contrasts() 

rmanovab() 

ezANOVAO 

rmmcpO 

ggpioto 

stat.desc() 

gi() 

summaryO 

gihto 

updateQ 

ImO 


Key terms that I’ve discovered 


Compound symmetry 
Greenhouse-Geisser correction 
Huynh-Feldt correction 
Lower bound 


Mauchly’s test 
Repeated-measures ANOVA 
Sphericity 


Smart Alex’s tasks 



• Task 1: Students often worry about the consistency of marking between lecturers. 
Lecturers obtain reputations for being ‘hard’ or ‘light’ markers (or to use the stu¬ 
dents’ terminology, ‘evil manifestations from Beelzebub’s bowels’ and ‘nice people’), 
but there is often little to substantiate these reputations. A group of students put 
the idea to the test by submitting the same essays to four different lecturers. The 
mark given by each lecturer was recorded for each of the eight essays. This design 
is repeated measures because every lecturer marked every essay. The independent 
variable was the lecturer who marked the report and the dependent variable was the 
percentage mark given. The data are in the file TutorMarks.dat. Conduct a one-way 
ANOVA on these data by hand. © 

• Task 2: Repeat the analysis above using R and interpret the results. © 

• Task 3: Imagine I wanted to look at the effect alcohol has on the roving eye. The ‘rov¬ 
ing eye’ effect is the propensity of people in relationships to ‘eye up’ members of the 
opposite sex. I took 20 men and fitted them with incredibly sophisticated glasses that 
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could track their eye movements and record both the movement and the object being 
observed (this is the point at which it should be apparent that I’m making it up as I 
go along). Over four different nights I plied these poor souls with 1, 2, 3 or 4 pints of 
strong lager in a nightclub. Each night I measured how many different women they 
eyed up (a woman was categorized as having been eyed up if the man’s eye moved 
from her head to her toe and back up again). To validate this measure we also col¬ 
lected the amount of dribble on the man’s chin while looking at a woman. The data 
are in the file RovingEye.dat. Analyse them with a one-way ANOVA. © 


• Task 4: In the previous chapter we came across the beer-goggles effect, a severe per¬ 
ceptual distortion after imbibing alcohol that makes previously unattractive people 
suddenly become the hottest thing since Spicy Gonzalez’s extra-hot Tabasco-marinated 
chillies. One minute you’re standing in a zoo admiring the orang-utans, and 2 pints 
later you’re wondering why someone would put the adorable Zoe Field in a cage. 
Imagine we followed up the fabricated example from the previous chapter to look at 
whether the beer-goggles effect is made worse by the fact that it usually occurs in clubs 
that have dim lighting. We took a sample of 26 men (because the effect is stronger 
in men) and gave them various doses of alcohol over four different weeks (0 pints, 
2 pints, 4 pints and 6 pints of lager). This is our first independent variable. Each week 
(and, therefore, in each state of drunkenness) participants were asked to select a mate 
in a normal club (that had dim lighting) and then select a second mate in a specially 
designed club that had bright lighting. As such, the second independent variable was 
whether the club had dim or bright lighting. The outcome measure was the attractive¬ 
ness of each mate as assessed by a panel of independent judges. To recap, all partici¬ 
pants took part in all levels of the alcohol consumption variable, and selected mates 
in both brightly and dimly lit clubs. The data are in the file BeerGogglesLighting.dat. 
Analyse them with a two-way repeated-measures ANOVA. © 

Answers can be found on the companion website. 



Further reading 


Field, A. P. (1998). A bluffer’s guide to sphericity. Newsletter of the Mathematical, Statistical and 
Computing Section of the British Psychological Society, 6(1), 13-22. (Available in the additional 
material for this chapter.) 

Howell, D. C. (2006). Statistical methods for psychology (6th ed.). Belmont, CA: Duxbury. (Or you 
might prefer his Fundamental Statistics for the Behavioral Sciences, also in its 6th edition, 2007.) 

Rosenthal, R., Rosnow, R. L., & Rubin, D. B. (2000). Contrasts and effect sizes in behavioural research: 
A correlational approach. (Cambridge: Cambridge University Press. This is quite advanced but 
really cannot be bettered for contrasts and effect size estimation.) 


Interesting real research 


Field, A. E (2006). The behavioral inhibition system and the verbal information pathway to chil¬ 
dren’s fears .Journal of Abnormal Psychology, 115(4), 742-752. 






Mixed designs (GLM 5) 



FIGURE 14.1 

My 18th birthday 
cake 



14.1. What will this chapter tell me? © 


Most teenagers are anxious and depressed, but I probably had more than my fair share. 
The parasitic leech that was the all-boys grammar school that I attended had feasted on my 
social skills, leaving in its wake a terrified husk. Although I had no real problem with play¬ 
ing my guitar and shouting in front of people, speaking to them was another matter entirely. 
In the band I felt at ease, in the real world I did not. Your 18th birthday is a time of great 
joy, where (in the UK at any rate) you cast aside the shackles of childhood and embrace the 
exciting new world of adult life. Your birthday cake might symbolize this happy transition 
by reflecting one of your great passions. Mine had a picture on it of a longhaired person 
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who looked somewhat like me, slitting his wrists. That pretty much sums it up. Still, you 
can’t lock yourself in your bedroom with your Iron Maiden albums for ever, and soon 
enough I tried to integrate with society. Between the ages of 16 and 18 this pretty much 
involved getting drunk. I quickly discovered that getting drunk made it much easier to speak 
to people, and getting really drunk made you unconscious and then the problem of speaking 
to people went away entirely. This situation was exacerbated by the sudden presence of girls 
in my social circle. I hadn’t seen a girl since Clair Sparks; they were particularly problematic 
because not only had you to talk to them, but what you said had to be really impressive 
because then they might become your girlfriend. Also, in 1990, girls didn’t like to talk about 
Iron Maiden - they probably still don’t. Speed dating 1 didn’t exist back then, but if it had it 
would have been a sick and twisted manifestation of hell on earth for me. The idea of having 
a highly pressured social situation where you have to think of something witty and amusing 
to say or be thrown to the baying vultures of eternal loneliness would have had me inject¬ 
ing pure alcohol into my eyeballs; at least that way I could be in a coma and unable to see 
the disappointment on the faces of those forced to spend 3 minutes in my company. That’s 
what this chapter is all about: speed dating, oh, and mixed ANOVA too, but if I mention 
that you’ll move swiftly onto the next chapter when the bell rings. 


14.2. Mixed designs © 


If you thought that the previous chapter was bad, well, I’m about to throw 
an added complication into the mix. We can combine repeated-measures and 
independent designs, and this chapter looks at this situation. As if this wasn’t 
bad enough, I’m also going to use it as an excuse to show you a design with 
three independent variables (at this point you should imagine me leaning 
back in my chair, cross-eyed, dribbling and laughing maniacally). A mixture 
of between-group and repeated-measures variables is called a mixed design. It 
should be obvious that you need at least two independent variables for this 
type of design to be possible, but you can have more complex scenarios too 
(e.g., two between-group and one repeated-measures, one between-group and 
two repeated-measures, or even two of each). R allows you to test almost any 
design you might want to, and of virtually any degree of complexity. However, 
interaction terms are difficult enough to interpret with only two variables, so 
imagine how difficult they are if you include four. The best advice I can offer is to stick to 
three or fewer independent variables if you want to be able to interpret your interaction 
terms, 2 and certainly don’t exceed four unless you want to give yourself a migraine. 

This chapter will go through an example of a mixed ANOVA. There won’t be any theory 
because really and truly you’ve probably had enough ANOVA theory by now to have 
a good idea of what’s going on (you can read this as ‘it’s too complex for me and I’m 
going to cover up my own incompetence by pretending you don’t need to know about 
it’). Essentially though, as we have seen, any ANOVA is a linear model, so when we have 
three independent variables or predictors we simply add this third predictor into the 



1 In case speed dating goes out of fashion and no one knows what I’m going on about, the basic idea is that lots 
of men and women turn up to a venue (or just men or just women if it’s a gay night), one-half of the group sit 
individually at small tables and the remainder choose a table, get 3 minutes to impress the other person at the 
table with their tales of heteroscedastic data, then a bell rings and they get up and move to the next table. Having 
worked around all of the tables, the end of the evening is spent either stalking the person whom you fancied or 
avoiding the hideous mutant who was going on about hetero... something or other. 

2 Fans of irony will enjoy the four-way ANOVAs that I conducted in Field and Davey (1999) and Field and Moore 
(2005), to name but two examples. 
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linear model, give it a b and remember to also include any interactions involving the new 
predictor. 

This chapter is spent looking at an example using R and then interpreting the output. In 
the process you’ll hopefully develop your understanding of interactions and how to break 
them down using contrasts. 


14.3. What do men and women look for 
in a partner? © 


The example we’re going to use in this chapter stays with the dating theme. It seems that lots 
of magazines (or perhaps it’s just my wife’s copies of Marie Claire, which I don’t read - hon¬ 
estly) go on all the time about how men and women want different things from relationships. 
The big question seems to be: are looks or personality more important? Imagine you wanted 
to put this to the test. You devised a cunning plan whereby you’d set up a speed-dating night. 
Little did the people who came along know that you’d got some of your friends to act as the 
dates. Each date varied in their attractiveness (attractive, average or ugly) and their charisma 
(charismatic, average and dull). By combining these characteristics you get nine different com¬ 
binations and each combination was represented by one of your stooge dates. As such, your 
stooge dates were made up of nine different people. Three were extremely attractive people 
but differed in their personality: one had tons of charisma, 3 one had some charisma and the 
other was as dull as this book. Another three people were of average attractiveness, and again 
differed in their personality: one was highly charismatic, one had average charisma and the 
third was a dullard. The final three were, with no offence intended to pigs, pig-ugly, and again 
one was charismatic, one had some charisma and the final poor soul was mind-numbingly 
tedious. Obviously you had two sets of stooge dates: one set was male and the other female, 
so that your participants could match up with dates of the appropriate sex. 

The participants themselves were not these nine stooges, but 10 men and 10 women who 
came to the speed-dating event that you had set up. Over the course of the evening they speed- 
dated all nine stooges of the gender that they’d normally date. After their 3-minute date, they 
rated how much they’d like to have a proper date with the person as a percentage (100% = ‘I’d 
pay large sums of money for their phone number’, 0% = ‘I’d pay a large sum of money for a 
plane ticket to get me as far away from them as possible’). As such, each participant rated nine 
different people who varied in their attractiveness and personality. So, there are two repeated- 
measures variables: looks (with three levels because the person could be attractive, average or 
ugly) and personality (again with three levels because the person could have lots of charisma, 
have some charisma or be a dullard). The people giving the ratings could be male or female, 
so we should also include the gender of the person making the ratings (male or female), and 
this, of course, will be a between-group variable. The data are in Table 14.1. 


14.4. Entering and exploring your data © 


14.4.1. 


Packages for mixed designs in R © 


You can analyse a mixed design using any of the four methods that I outlined in the previ¬ 
ous chapter for doing a repeated-measures designs. As with the previous chapter, we’re 

3 The highly attractive people with tons of charisma were, of course, taken to a remote cliff top and shot after the 
experiment because life is hard enough without having people like that floating around making you feel inadequate. 
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Table 14.1 Data from LooksOrPersonality.dat (Att = attractive, Av = average, Ug = ugly) 


Looks 

High Charisma 

Att Av Ug 

Some Charisma 

Att Av Ug 

Att 

Dullard 

Av 

ug 

Male 

86 

84 

67 

88 

69 

50 

97 

48 

47 


91 

83 

53 

83 

74 

48 

86 

50 

46 


89 

88 

48 

99 

70 

48 

90 

45 

48 


89 

69 

58 

86 

77 

40 

87 

47 

53 


80 

81 

57 

88 

71 

50 

82 

50 

45 


80 

84 

51 

96 

63 

42 

92 

48 

43 


89 

85 

61 

87 

79 

44 

86 

50 

45 


100 

94 

56 

86 

71 

54 

84 

54 

47 


90 

74 

54 

92 

71 

58 

78 

38 

45 


89 

86 

63 

80 

73 

49 

91 

48 

39 

Female 

89 

91 

93 

88 

65 

54 

55 

48 

52 


84 

90 

85 

95 

70 

60 

50 

44 

45 


99 

100 

89 

80 

79 

53 

51 

48 

44 


86 

89 

83 

86 

74 

58 

52 

48 

47 


89 

87 

80 

83 

74 

43 

58 

50 

48 


80 

81 

79 

86 

59 

47 

51 

47 

40 


82 

92 

85 

81 

66 

47 

50 

45 

47 


97 

69 

87 

95 

72 

51 

45 

48 

46 


95 

92 

90 

98 

64 

53 

54 

53 

45 


95 

93 

96 

79 

66 

46 

52 

39 

47 


going to stick with ezANOVA() for those of you who want to adopt an ANOVA approach 
to the data, and lme() for those of you who want to use a multilevel model (which we 
recommend). 

As with fully repeated-measures designs, you have to use commands because R 
Commander doesn’t have an interface for repeated-measures designs. You will need the 
packages ez (if you’re going to use ANOVA), ggplot2 (for graphs), nlme (if you use a mul¬ 
tilevel model), pastecs (for descriptive statistics), reshape (for reshaping the data) and WRS 
(for robust tests). If you do not have these packages installed (some should be installed 
from previous chapters), you can install them by executing the following commands: 

install.packages("ez"); install. packages("ggplot2") ; install. packages("nlme"); 
install.packagesC'pastecs"); install.packages("reshape"); install.packages 
("WRS", repos="http://R-Forge.R-project.org") 

You then need to load these packages by executing these commands: 

library(ez); library(ggplot2); library(nlme); library(pastecs); 
library(reshape); library(WRS) 
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14.4.2. 


General procedure for mixed designs © 


To analyse research with a mixed design you should follow this general procedure: 

1 Enter data: which is about as awkward as it was for repeated-measures designs. 

2 Explore your data-, as with repeated-measures designs, look at graphs, descriptive 
statistics and check sphericity if you’re using ANOVA (boo, hiss) rather than a mul¬ 
tilevel model (hooray!). 

3 Construct or choose contrasts: you need to decide what contrasts to do and to specify 
them appropriately for all of the independent variables in your analysis. 

4 Compute the main model: you can then run the main analysis. Depending on what 
you found in the previous step, you might need to run a robust version of the test. 

5 Compute contrasts or post hoc tests: having conducted the main analysis, you can 
follow it up with post hoc tests or look at the results of your contrasts. 

We will work through these steps in turn. 


14.4.3. 


Entering the data © 



The data for the example can be found in the file LooksOrPersonality.dat. You can load 
this data file by setting your working directory to the appropriate location and executing: 

dateData<-read.delim("LooksOrPersonality.dat", header = TRUE) 

I have again structured the data in the format that you’d be most likely to use if you had 
entered the data in another software package and followed the usual conventions. The 
data have been entered in ‘wide’ format; that is, levels of the repeated-measures variable are 
spread across different columns. 

In this experiment there are nine experimental conditions and so the data have been 
entered in nine columns (so the format is identical to Table 14.1): 


atthigh 

Attractive 

+ 

High Charisma 

avhigh 

Average Looks 

+ 

High Charisma 

ug_high 

Ugly 

+ 

High Charisma 

attsome 

Attractive 

+ 

Some Charisma 

avsome 

Average Looks 

+ 

Some Charisma 

ugsome 

Ugly 

+ 

Some Charisma 

attnone 

Attractive 

+ 

Dullard 

avnone 

Average Looks 

+ 

Dullard 

ugnone 

Ugly 

+ 

Dullard 


There is also a column to indicate to which person each row of data belongs (called partici¬ 
pant) and a column to indicate their gender (gender). 

As with the example in the previous chapter, although the format of the data follows 
typical conventions, because of the way R handles repeated-measures designs we need the 
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data to be in the long format. We can again use the melt() function. We specify columns in 
the data that identify characteristics of the scores (such as from whom they originate and 
characteristics of that person) using the id option, and columns that identify the scores 
themselves using the measured option. In this case our scores are split over nine columns 
( att_high, av_high, ug_high, att_some, av_some, ug_some, attjione, avjione, ugjione), so 
these are our measured variables. We have two variables that remain constant for each of 
the nine scores: the participant from which the score comes and their gender. These are our 
id variables. We can create a new dataframe (called speedData) by executing: 

speedData<-meltCdateData, id = c("participant","gender"), measured = c("att_ 
high", "av_high", "ug_high", "att_some", "av_some", "ug_some", "att_none", 
"av_none", "ug_none")) 

This dataframe contains four columns: the first identifies the participant, the second 
identifies their gender, the third identifies the name of the column from which the data 
originate, and the fourth contains the rating of the date. By default, these columns will be 
named participant, gender, variable, and value. The latter two labels are not helpful, so 
we’ll rename these columns so that we know what they represent by executing: 

names(speedData)<-c("participant", "gender", "groups", "dateRating") 

The variable groups is a mixture of our two predictor variables (looks and personality). 
Note, for example, that the first 60 rows are scores for the high-charisma dates and, within 
these 60 rows, the first 20 are the scores for the attractive stooges, the next 20 are the 
scores for the average looking stooges, and the final 20 are the scores for the ugly stooges. 
We therefore, need to create two variables that dissociate the attractiveness of the stooge 
from their charisma level; these two variables will be the two predictors in our model. 

First, let’s create a variable called personality, which specifies whether the date being 
rated had high, average or low charisma. We can do this using the gl() function that we 
have used in previous chapters. Execute this command: 

speedData$personality<-gl(3, 60, labels = c("Charismatic", "Average", 
"Dullard")) 

This creates a variable personality in the dataframe speedData. The numbers in the func¬ 
tion tell R that we want to create three sets of 60 scores, the labels option then speci¬ 
fies the names to attach to these three sets, which correspond to the levels of charisma. 
Essentially, this will create 60 rows with the label Charismatic then 60 labelled Average 
then 60 labelled Dullard. 

We also need a variable (called looks) that tells us how attractive the date was. To do this 
we want three groups that each contain 20 scores. This will create 60 cases (3 x 20 = 60), 
or, put another way, it will create the codes for the first level ( Charismatic ) of the personal¬ 
ity variable. We want this pattern to be repeated for the remaining two levels of personality 
(i.e., Average and Dullard). We can do this by adding a third value to the function that is the 
total number of cases (i.e., 180). By specifying the total number of cases, the gl() function 
will repeat the pattern of codes until it reaches this total number of cases 

speedData$looks<-gl(3, 20, 180, labels = c("Attractive", "Average", "Ugly")) 

The data now look like this (edited and ordered by participant): 



participant 

gender 

groups 

dateRating 

personality 

looks 

1 

P01 

Male 

att_high 

86 

Charismatic 

Attractive 

21 

P01 

Male 

av_high 

84 

Charismatic 

Average 

41 

P01 

Male 

ug_high 

67 

Charismatic 

Ugly 

61 

P01 

Male 

att_some 

88 

Average 

Attractive 

81 

P01 

Male 

av_some 

69 

Average 

Average 

101 

P01 

Male 

ug_some 

50 

Average 

Ugly 
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121 

P01 

Male 

att_none 

97 

Dullard 

Attractive 

141 

P01 

Male 

av_none 

48 

Dullard 

Average 

161 

P01 

Male 

ug_none 

47 

Dullard 

Ugly 

20 

P20 

Female 

att_high 

95 

Charismatic 

Attractive 

40 

P20 

Female 

av_high 

93 

Charismatic 

Average 

60 

P2 0 

Female 

ug_high 

96 

Charismatic 

Ugly 

80 

P2 0 

Female 

att_some 

79 

Average 

Attractive 

100 

P2 0 

Female 

av_some 

66 

Average 

Average 

120 

P2 0 

Female 

ug_some 

46 

Average 

Ugly 

140 

P2 0 

Female 

att_none 

52 

Dullard 

Attractive 

160 

P2 0 

Female 

av_none 

39 

Dullard 

Average 

180 

P20 

Female 

ug_none 

47 

Dullard 

Ugly 


Notice that each participant (identified by the participant variable) has a value indicating 
their gender and then nine scores (distinguished by the variables looks and personality). In 
the reformatted data the nine scores within each participant are now represented by nine 
different rows rather than nine columns as they were before. 



SELF-TEST 

s Using what you learnt earlier in the chapter and the 
commands that we have just used to create looks 
and personality, can you work out how to enter the 
data into R directly? 


14 . 4 . 4 . 


Exploring the data © 


As ever, we’ll look at some graphs first. To save space we’ll look just at the boxplots at this 
stage. 



SELF-TEST 

s Use ggplot2 to plot boxplots of the rating of the 
dates according to their level of attractiveness 
(x-axis), and level of charisma (different colours) for 
men and women (different plots). 


The resulting plot (Figure 14.2) shows that the pattern of scores for average-looking 
dates was quite similar for males and females (their ratings decreased as the dates varied 
from charismatic to dullards). For attractive dates, males and females gave similar ratings 
except for when the date was dull (when males’ ratings remained high but females’ ratings 
dropped). For ugly dates, again the ratings were similar for men and women, except that 
women rated the charismatic dates higher than males did. 
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Male 




FIGURE 14.2 

Boxplots of the 
attitude data 


Charisma 

F^ Charismatic 
Average 
F^ Dullard 


Attractive Average Ugly Attractive Average Ugly 

Attactiveness 


We have previously used the by() function and the stat.desc() function in the pastecs 
package to get descriptive statistics for separate groups (see Chapter 5 for more detail). We 
also saw in the previous chapter that if we want to create descriptives for a combination 
of variables we can simply list all of the variables in the listQ function; therefore, to get 
descriptive statistics for the combined levels of gender, looks and personality we execute: 

by(speedData$dateRating, list(speedData$looks , speedData$personality, 
speedData$gender), stat.desc, basic = FALSE) 

The resulting (edited) output is in Output 14.1. The output contains descriptive statistics 
(means, standard deviations, etc.) for each of the nine conditions split according to whether 
participants were male or female. These descriptive statistics are interesting because they 
show us the pattern of means across all experimental conditions (so we use these means to 
produce the graphs of the three-way interaction). 

: Attractive 
: Charismatic 
: Male 


median mean 

89.000 88.300 

SE.mean 

1.802 

Cl.mean.0.95 

4.075 

var 

32.456 

std.dev 

5.697 

coef.var 

0.065 

: Average 

: Charismatic 

: Male 

median mean 

84.000 82.800 

SE.mean 

2.215 

Cl.mean.0.95 

5.011 

var 

49.0667 

std.dev 

7.005 

coef.var 

0.085 

: Ugly 

: Charismatic 

: Male 

median mean 

56.500 56.800 

SE.mean 

1.812 

Cl.mean.0.95 

4.100 

var 

32.844 

std.dev 

5.731 

coef.var 

0.101 

: Attractive 

: Average 

: Male 

median mean 

87.500 88.500 

SE.mean 

1.815 

Cl.mean.0.95 

4.106 

var 

32.944 

std.dev 

5.740 

coef.var 

0.065 
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: Average 
: Average 
: Male 

median mean SE.mean Cl.mean.0.95 var std.dev coef.var 


71.000 71.800 

1.397 

3.160 

19.5111 

4.417 

0.061 

: Ugly 
: Average 

: Male 

median mean 

48.500 48.300 

SE.mean 

1.700 

Cl.mean.0.95 

3.846 

var 

28.900 

std.dev 

5.376 

coef.var 

0 . Ill 

: Attractive 

: Dullard 

: Male 

median mean 

86.500 87.300 

SE.mean 

1.720 

Cl.mean.0.95 

3.890 

var 

29.567 

std.dev 

5.438 

coef.var 

0.062 

: Average 
: Dullard 

: Male 

median mean 

48.000 47.800 

SE.mean 

1.323 

Cl.mean.0.95 

2.994 

var 

17.511 

std.dev 

4.185 

coef.var 

0.088 

: Ugly 
: Dullard 

: Male 

median mean 

45.500 45.800 

SE.mean 

1.133 

Cl.mean.0.95 

2.564 

var 

12.844 

std.dev 

3.584 

coef.var 

0.078 

: Attractive 

: Charismatic 

: Female 

median mean 

89.000 89.600 

SE.mean 

2.099 

Cl.mean.0.95 

4.748 

var 

44.044 

std.dev 

6.637 

coef.var 

0.074 

: Average 

: Charismatic 

: Female 

median mean 

90.500 88.400 

SE.mean 

2.634 

Cl.mean.0.95 

5.958 

var 

69.378 

std.dev 

8.329 

coef.var 

0.094 

: Ugly 

: Charismatic 

: Female 

median mean 

86.000 86.700 

SE.mean 

1.720 

Cl.mean.0.95 

3.890 

var 

29.567 

std.dev 

5.438 

coef.var 

0.063 

: Attractive 

: Average 
: Female 

median mean 

86.000 87.100 

SE.mean 

2.152 

Cl.mean.0.95 

4.869 

var 

46.322 

std.dev 

6.81 

coef.var 

0.078 

: Average 
: Average 
: Female 

median mean 

68.000 68.900 

SE.mean 

1.882 

Cl.mean.0.95 

4.258 

var 

35.433 

std.dev 

5.953 

coef.var 

0.086 
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Ugly 

Average 

Female 


median mean 

52.000 51.200 

SE.mean 

1.724 

Cl.mean.0.95 

3.901 

var 

29.733 

std.dev 

5.453 

coef.var 

0.107 

: Attractive 

: Dullard 

: Female 

median mean 

51.500 51.800 

SE.mean 

1.093 

Cl.mean.0.95 

2.474 

var 

11.956 

std.dev 

3.458 

coef.var 

0.067 

: Average 

: Dullard 

: Female 

median mean 

48.000 47.000 

SE.mean 

1.183 

Cl.mean.0.95 

2.677 

var 

14.000 

std.dev 

3.742 

coef.var 

0.080 

: Ugly 
: Dullard 

: Female 

median mean 

46.500 46.100 

SE.mean 

0.971 

Cl.mean.0.95 

2.197 

var 

9.433 

std.dev 

3.07 

coef.var 

0.067 


Output 14.1 


14.5. Mixed ANOVA© 


As with repeated-measures ANOVA, we can do a fairly easy analysis using the ezANOVAQ 
function. If you want to approach the analysis in this way and you plan to use Type III sums 
of squares (see Jane Superbrain Box 11.1) then you have to set an orthogonal contrast for 
your predictor variables, otherwise you might think you’re getting type III sums of squares, 
but actually you won’t be. Let’s do this first. 

For both personality and looks we could consider the lowest categories (Dullard 
and Ugly) as useful control conditions. Therefore, for personality, we could cre¬ 
ate contrasts that compare some charisma (i.e., average and charismatic) to being 
a dullard, and then compare the charismatic with the average date (Table 14.2). 
Likewise, for looks, we could create a contrast that compares some attractiveness 
(i.e., average and attractive combined) to being ugly, and then a second contrast that 
compares the attractive with the average date (Table 14.3). These contrasts would be 
orthogonal (the rationale is basically the same as for the contrasts that we encountered 
in Chapter 10). 

To set these orthogonal contrasts (see Chapter 10) we can first create variables repre¬ 
senting each contrast (which is useful mainly because you can give the contrasts inform¬ 
ative names), and then bind these variables together and set them as the contrast for 

personality: 

SomevsNone<-c(l, 1, -2) 

HivsAv<-c(l, -1, 0) 

contrasts(speedData$personality)<-cbind(SomevsNone, HivsAv) 

The first two commands each create a variable relating to a contrast that contains the codes 
for each group from Table 14.2. The final command sets these variables to be the contrasts 
for personality. 
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Table 14.2 Orthogonal contrasts for the personality variable 


Group 

Contrast , 

Contrast 2 

Charismatic 

1 

-1 

Average 

1 

1 

Dullard 

-2 

0 

Table 14.3 

Orthogonal contrasts for the looks variable 


Group 

Contrast 1 

Contrast 2 

Attractive 

1 

-1 

Average 

1 

1 

Ugly 

-2 

0 


We can set the contrasts for looks in the same way as for personality: we first create 
variables representing each contrast, and then bind these variables together and set them 
as the contrast for looks: 

AttractivevsUgly<-c(l, 1, -2) 

AttractvsAv<-c(l, -1, 0) 

contrasts(speedData$looks)<-cbind(AttractivevsUgly, AttractvsAv) 

The first two commands each create a variable relating to a contrast that contains the 
codes for each group from in Table 14.3. The final command sets these variables to be the 
contrasts for looks. 

Having set the contrasts, we can use the ezANOVAQ function in much the same way as 
for a repeated-measures design. The only difference is that we can add our between-group 
variable (gender) by using the between = .() option. Have a quick look back to section 
13.4.7.1 to remind yourself of the format of the function, and how the between —.() option 
fits into it. We can run the analysis by executing the following command: 

speedModel<-ezANOVA(data = speedData, dv = .(dateRating), wid = 

.(participant), between = .(gender), within = .(looks, personality), type 

= 3, detailed = TRUE) 

speedModel 

Executing these commands creates a model called speedModel-, within the ezANOVA() 
function we have the following commands: 

• data = speedData: This tells ezANOVA that the data are in the dataframe called 
speedData. 

• dv = .(dateRating): This tells ezANOVA that the outcome variable is dateRating. 

• wid = .(participant): This tells ezANOVA that participants can be identified using the 
variable participant. 

• between = .(gender): This tells ezANOVA that gender was measured using different 
entities/participants (i.e., it is a between-group variable). 

• within = .(looks, personality): This tells ezANOVA that looks and personality were 
measured using the same entities/participants (i.e., they are repeated-measures 
variables). 
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• type = 3: This tells ezANOVA that we want Type III sums of squares. 

• detailed = TRUE: This tells ezANOVA to produced a detailed output (i.e., one that 
includes sums of squares). 

Executing the second command ( speedModel ) simply prints the model to the console. 


$ANOVA 

Effect DFn DFd SSn 


(Intercept) 

1 

18 

846249. 

gender 

1 

18 

0.2 

looks 

2 

36 

20779.6 

gender:looks 

2 

36 

3944.1 

personality 

2 

36 

23233.6 

gender personality 

2 

36 

4420.1 

looks personality 

4 

72 

4055.3 

gender:looks:pers 

4 

72 

2669.7 


$'Mauchly's Test for Sphericity' 
Effect W 

3 looks 0.960 

4 gender:looks 0.960 

5 personality 0.929 

6 gender:personality 0.929 

7 looks:personality 0.613 

8 gender:looks:personality 0.613 


SSd 

F 

P P<- 

05 

ges 

760 

2.00e+04 

7.Ole-29 

* 

9.94e-01 

760 

4.74e-03 

9.46e-01 


4.07e-05 

883 

4.24e+02 

9.59e-2 6 

* 

8.09e-01 

883 

8.04e+01 

5.23e-14 

* 

4.45e-01 

1274 

3.28e+02 

7.69e-24 

* 

8.26e-01 

1274 

6.24e+01 

1.97e-12 

* 

4.74e-01 

1993 

3.66e+01 

1.10e-16 

* 

4.52e-01 

1993 

2.41e+01 

1.lle-12 

* 

3.52e-01 


p p<.05 
0.708 
0.708 
0.536 
0.536 
0.534 
0.534 


$'Sphericity Corrections' 

Effect GGe 
looks 0.962 
gender:looks 0.962 
personality 0.934 
gender personality 0.934 
looks personality 0.799 
gender:looks personality 0.799 


P[GG] p 
7.62e-25 
1.49e-13 
2.06e-22 
9.44e-12 
9.00e-14 
1.47e-10 


[GG]<.0 5 HFe 

* 1.074 

* 1.074 

* 1.038 

* 1.038 

* 0.992 

* 0.992 


p[HF] p[HF]<.05 
9.59e-26 * 
5.23e-14 * 
7.69e-24 * 
1.97e-12 * 
1.43e-16 * 
1.34e-12 * 


Output 14.2 



SELF-TEST 

s Output 14.2 shows the results of Mauchly’s 
sphericity test. Based on what you have already 
learnt, was sphericity violated? 


Output 14.2 shows the results of Mauchly’s sphericity test for each of the three repeated- 
measures effects in the model and their interaction with gender. None of the effects violate 
the assumption of sphericity because all of the values in the column labelled p are above .05 
(there are also no asterisks in the column labelled p < .05 for Mauchley’s test); therefore, 
we can assume sphericity when we look at our T-statistics. 

Output 14.2 shows the summary table (labelled $ANOVA) of the effects in the ANOVA. 
There are the three main effects of our predictor variables, but also three interaction effects 
involving two variables and another interaction that includes all three variables. 

Again, we need to look at the column labelled p and if the values in this column are less 
than .05 for a particular effect then it is statistically significant. Working down from the 
top of the table we find a non-significant effect of gender, which means that if we ignore 
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the attractiveness and charisma levels of the data, male and female participants did not 
differ in the ratings they gave. There is a significant effect of looks, which means that if 
we ignore whether the date was charismatic, and whether the rating was from a man or a 
woman, then the attractiveness of a person significantly affected the ratings they received. 
The looks x gender interaction is also significant, which means that although the ratings 
were affected by whether the date was attractive, average or ugly, the way in which ratings 
were affected by attractiveness was different in male and female raters. 

Next, we find a significant effect of personality, which means that if we ignore whether 
the date was attractive, and whether the rating was from a man or a woman, then the cha¬ 
risma of a person significantly affected the ratings they received. The personality x gender 
interaction is also significant, indicating that this effect of charisma differed in male and 
female raters. 

There is a significant interaction between looks and personality, which means that if we 
ignore the gender of the rater, the profile of ratings across different levels of attractiveness 
was different for highly charismatic dates, charismatic dates and dullards. (It is equally 
true to say this the opposite way around: the profile of ratings across different levels of 
charisma was different for attractive, average and ugly dates.) Just to add to the mount¬ 
ing confusion, the looks x personality x gender interaction is also significant, meaning 
that the looks x personality interaction was significantly different in male and female 
participants. 

This is all a lot to take in, so we’ll look at each of these effects in turn in subsequent sec¬ 
tions. First, though, we will look at how to analyse the data as a multilevel model. 




SELF-TEST 

s What is the difference between a main effect and an 
interaction? 



CRAMMING SAM’S TIPS 


Mixed ANOVA 


• Mixed designs are where you compare several means when there are two or more independent variables, and at least one of 
them has been measured using the same participants and at least one other has been measured using different participants. 

• You can analyse these designs using a traditional ANOVA framework, or as a multilevel model. 

• If you plan to look at Type III sums of squares then you must set an orthogonal contrast for all predictors before constructing 
the model. 

• If you use an ANOVA approach, then test the assumption of sphericity for the repeated-measures variable(s) when they have 
three or more conditions using Mauchly’s test. If the value in the column labelled p is less than .05 then the assumption is 
violated. You should test this assumption for all effects (if there are two or more repeated-measures variables, this means 
you test the assumption for all variables and the corresponding interaction terms). 

• For each effect in the ANOVA, if the assumption of sphericity has been met then use the p-value for the main ANOVA. If 
the assumption was violated then read the p-value corrected using either the Greenhouse-Geisser [p[GG]) or Huynh—Feldt 
( p[HF]) estimate of sphericity. If the p-value is less than .05 then the means of the groups are significantly different. 

• Look at the means, or better still draw graphs, to help you interpret the contrasts. 
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14.6. Mixed designs as a GLM © 


I outlined various advantages to analysing repeated-measures data with a multilevel model 
in the previous chapter. We can extend the repeated-measures designs from the previous 
chapter by simply adding the between-group variable (and any interaction terms) as an 
additional predictor. Remember that when we use a multilevel model, the repeated meas¬ 
ures are specified in the random part of the model, so adding in a between-group predictor 
does not affect this part of the model (because it is not a repeated measure). We can liter¬ 
ally just set up the model as we would for a repeated-measures design, but then add the 
between-group predictor. 


14 . 6 . 1 . 


Setting contrasts © 


Before we build the model, we need to set contrasts. You might find this weird because 
we already set some contrasts for using ezANOVA. However, the contrasts we set before 
were simply so that we could get Type III sums of squares, and we were constrained to 
use orthogonal contrasts. However, if we use a multilevel model we don’t have to worry 
about orthogonal contrasts because we don’t need to concern ourselves with types of sums 
of squares in the same way that we do for ANOVA. Therefore, I’m going to use different 
contrasts to highlight how, despite the steeper learning curve, multilevel models are worth 
using for repeated-measures data because they offer a much more flexible framework for 
analysing your data. 

If we look at the first variable, looks, there were three conditions: attractive, average 
and ugly. In many ways it makes sense to compare the attractive and ugly conditions to 
the average, because the average person represents the norm. This would not make sense 
as an orthogonal contrast because it would mean grouping the attractive and ugly groups 
together in a contrast and these groups are polar opposites (i.e., we might expect their 
ratings to cancel out because presumably attractive dates receive higher ratings than ugly 
dates). However, we can set up this contrast as a non-orthogonal contrast. Table 14.4 shows 
how this would be done. The key is that the baseline category is coded as 0 for all contrasts 
(that’s how R knows it is the baseline). So, we give our baseline group (average attractive¬ 
ness) a value of 0 in both contrasts. Then for one of the contrasts we assign a 1 to attrac¬ 
tive and in the other we assign a 1 to ugly. Contrast 1 compares the attractive condition to 
the average condition (we can tell this because the attractive group is assigned a 1 for this 
contrast), and contrast 2 compares the ugly condition to the average condition (we can tell 
this because the ugly group is assigned a 1 for this contrast). 

To set these contrasts (see Chapter 10) we can first create variables representing each 
contrast (which is useful in giving the contrasts informative names), and then bind these 
variables together and set them as the contrast for looks: 

AttractivevsAv<-c(l, 0, 0) 

UglyvsAv<-c(0, 0, 1) 

contrasts(speedData$looks)<-cbind(AttractivevsAv, UglyvsAv) 

The first two commands each create a variable relating to a contrast that contains the codes 
for each group from Table 14.4. The final command sets these variables to be the contrasts 
for looks. 

Now, let’s think about the second predictor. The personality variable also has a category 
that represents the norm, and that is the dates with the average amount of charisma. Again 
we could use this condition as a control against which to compare our two extremes (lots 
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Table 14.4 

Non-orthogonal contrasts for the looks variable 


Group 

Contrast 1 

Contrast 2 

Attractive 

1 

0 

Average 

0 

0 

Ugly 

0 

1 

Table 14.5 

Non-orthogonal contrasts for the personality variable 


Group 

Contrast , 

Contrast 2 

Charismatic 

1 

0 

Average 

0 

0 

Dullard 

0 

1 


of charisma and none whatsoever). Therefore, we could again set the contrasts codes such 
that charismatic is compared to average charisma in contrast 1 and dullard is compared to 
average charisma in contrast 2. The codes are in Table 14.5 and you should note that we 
have used the same codes that we did for looks. 

To set these contrasts we can first create variables representing each contrast and set 
them as the contrast for personality: 

HighvsAv<-c(l, 0, 0) 

DullvsAv<-c(0, 0, 1) 

contrasts(speedData$personality)<-cbindCHighvsAv, DullvsAv) 

The first two commands each create a variable relating to a contrast that contains the codes 
for each group from Table 14.5. The final command sets these variables to be the contrasts 
for personality. 

We also have a third variable gender. We don’t need to set an explicit contrast for this 
variable because it has only two levels (male or female) therefore the default contrast will 
be fine. (With only two groups any contrast we set can only compare these two groups, 
therefore setting a contrast is pointless.) If your third variable had more than two levels 
then you should also set a contrast for this variable that tests the hypotheses that you have. 

We can check that we have set the contrast correctly by executing the name of the vari¬ 
able and looking at the contrast attribute: 

speedData$looks 

attr(,"contrasts") 

AttractivevsAv UglyvsAv 
Attractive 1 0 

Average 0 0 

Ugly 0 1 

Levels: Attractive Average Ugly 

speedData$personality 

attr(,"contrasts") 

HighvsAv DullvsAv 
Charismatic 1 0 

Average 0 0 

Dullard 0 1 

Levels: Charismatic Average Dullard 

As you can see the codes match those in Tables 14.4 and 14.5. 
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14 . 6 . 2 . 


Building the model © 


We saw in the previous chapter that if we want to look at the overall main effects and 
interactions then we should build up the model one predictor at a time from a baseline that 
includes no predictors other than the intercept. We can specify the baseline model as we 
did in the previous chapter: 

baseline<-lmeCdateRating ~ 1, random = ~11 participant/looks/personality, 
data = speedData, method = "ML") 

Compare this model with the one that we used for a factorial repeated-measures design in 
the previous chapter. Apart from the variables we’re using, it is exactly the same: we have 
specified the model as the outcome predicted only from the intercept (dateRating ~ 1), 
specified the relevant dataframe (data = speedData), and asked to use maximum likelihood 
to estimate the model (method = “ML”). The random part of the model reflects the fact that 
there are two repeated-measures predictors: random = ~l\participant/looks/personality 
tells R that the variables looks and personality are nested within the variable participant 
(in other words, scores for levels of these variables can be found within each participant). 
Execute the above command to create the baseline model. 

To see the overall effect of each main effect and interaction we need to add them to the 
model one at a time. To add looks to the model we could just change the model from dat¬ 
eRating — 2 to dateRating ~ looks. In other words, execute: 

looksM<-lme(dateRating ~ looks, random = ~11 participant/looks/personality, 
data = speedData, method = "ML") 

However, it is quicker to use the update() function (see R’s Souls’ Tip 7.2): 
looksM<-update(baseline, + looks) 

This command takes the model called baseline (which we have already created), and the 
means keep the outcome and predictors the same as the baseline model (the dots mean ‘keep 
the same’, so the fact that we put dots on both sides of the ~ means that we want to keep 
both the outcome and predictors the same as in the baseline model). The ‘+ looks’ means 
‘add looks as a predictor’. Therefore, + looks’ can be interpreted as ‘keep the same 
outcomes and predictors as the baseline model but add looks as a predictor’. Executing this 
command creates a model called looksM that includes only looks as a predictor. 

In a similar way we can add personality to the model as a predictor. 

personalityM<-update(looksM, + personality) 

This command takes the model called looksM (which we have just created), as before, the 
. — . means keep the outcome and predictors the same as in looksM, and the ‘+ personality’ 
adds personality as a predictor. Therefore, + personality’ can be interpreted as ‘keep 
the same outcomes and predictors as looksM but add personality as a predictor’. Executing 
this command creates a model called personalityM that includes both looks and personality 
as predictors. 

We can add gender to the model in exactly the same way: 
genderM<-update(personalityM, + gender) 

This command takes the previous model (personalityM), keeps all of the same predictors 
and outcomes (.—.) and adds gender (+ gender). Therefore, executing this command creates 
a model called genderM that includes looks, personality and gender as predictors. 

We also need to add in the interactions between pairs of variables (the two-way inter¬ 
actions). There are three of these interactions made up from all of the combinations of 
the three main effects: looks x personality, looks x gender, and personality x gender. 
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Remember that in R an interaction is written using a colon, so the looks x personality 
interaction can be specified as looks.-personality. 

We can add these interactions one at a time using the update command. Each time we 
create a new model that contains all of the terms from the previous model but adds in an 
interaction: 

looks_gender<-update(genderM, + looks:gender) 
personality_gender<-update(looks_gender, + personality:gender) 

looks_personality<-update(personality_gender, + looks:personality) 

Note that the model called looks_gender is created by taking the genderM model (which 
contains all of the main effects) and adding the looks x gender interaction to it. Similarly, 
the personality_gender model is created by taking the looks_gender model and adding the 
personality x gender interaction to it. Hopefully you get the gist. 

We also need to include the interaction of all three variables, which would be written in 
R as looks.-personality .-gender. Again, we do this with the update command. We take the 
model with all of the main effects and two-way interactions (looks-personality) and add in 
the three-way interaction: 

speedDateModel<-update(looks_personality, + looks:personality:gender) 

Executing this command creates a model called speedDateModel, which contains all main 
effects and interactions. This is the final model. 

To compare these models we can list them in the order in which we want them compared 
in the anova() function: 

anovafbaseline, looksM, personality!!, genderM, looks_gender, personality_ 
gender, looks_personality, speedDateModel) 

Executing the above command produces Output 14.3, which first compares the effect of 
looks to the baseline (i.e., no predictors). By adding looks as a predictor we increase the 
degrees of freedom by 2 (the two contrasts that we used to code this variable) and signifi¬ 
cantly improve the model. In other words, the attractiveness of the date had a significant 
effect on ratings, / 2 (2) = 68.30, p < .0001. Next, we see the effect of adding the main effect 
of personality into the model (compared to the previous model that contained only the 
effect of looks). Again the degrees of freedom are increased by 2 (the two contrasts used to 
code this variable) and the fit of the model is significantly improved; the personality of the 
date had a significant effect on attitudes, x 2 (2) = 138.76, p < .0001. The next model tells 
us whether adding gender improved the fit of the model; it did not, indicating that gender 
did not have a significant overall effect on ratings X 2 (l) = 0.002, p = .966. This effect adds 
only 1 degree of freedom because it was coded with a single contrast. 



Model 

df 

AIC 

BIC 

logLik 


Test 

L.Ratio 

p-value 

baseline 

1 

5 

1575.766 

1591.730 

-782.8829 






looksM 

2 

7 

1511.468 

1533.819 

-748.7343 

1 

VS 

2 

68.29719 

<.0001 

personalityM 

3 

9 

1376.704 

1405.441 

-679.3520 

2 

vs 

3 

138.76442 

<.0001 

genderM 

4 

10 

1378.702 

1410.632 

-679.3511 

3 

vs 

4 

0.00180 

0.9662 

looks_gender 

5 

12 

1343.161 

1381.477 

-659.5808 

4 

vs 

5 

39.54079 

<.0001 

personality_gender 

6 

14 

1289.198 

1333.899 

-630.5988 

5 

vs 

6 

57.96394 

<.0001 

looks_personality 

7 

18 

1220.057 

1277.530 

-592.0283 

6 

vs 

7 

77.14102 

<.0001 

speedDateModel 

8 

22 

1148.462 

1218.707 

-552.2309 

7 

vs 

8 

79.59473 

<.0001 


Output 14.3 


The next model (looks_gender) shows that the looks x gender interaction is also sig¬ 
nificant, x 2 (2) = 39.54, p < .0001. This interaction adds 2 degrees of freedom (because 
looks is coded with two contrasts and gender only one, so we get 2x1=2 df). This sig¬ 
nificant interaction means that although the ratings were affected by whether the date was 
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attractive, average or ugly, the way in which ratings were affected by attractiveness was 
different in male and female raters. 

The next model (personality_gender) shows that the personality x gender interaction is 
also significant, / 2 (2) = 57.96, p < .0001, indicating that this effect of charisma differed in 
male and female raters. This interaction adds 2 degrees of freedom (because personality is 
coded with two contrasts and gender only one, so we get 2x1 = 2 df). 

The next model (lookspersonality) tells us that there is a significant interaction between 
looks and personality, x 2 (4) = 77.14, p < .0001. This interaction adds 4 degrees of freedom 
(because it is made up of two variables each coded with two contrasts, so we get 2x2 = 4 df). 
This interaction term means that if we ignore the gender of the rater, the profile of ratings 
across different levels of attractiveness was different for highly charismatic dates, charismatic 
dates and dullards. (It is equally true to say this the opposite way around: the profile of rat¬ 
ings across different levels of charisma was different for attractive, average and ugly dates.) 

The final model (speedDateModel) shows that the looks x personality x gender interac¬ 
tion is also significant, x 2 (4) = 79.59, p < .0001, meaning that the looks x personality inter¬ 
action was significantly different in male and female participants. This interaction adds 4 
degrees of freedom (because personality is coded with two contrasts, looks is also coded 
with two contrasts, and gender with only one, so we get 2x2xl=4 df). 

These results confirm the findings from the ANOVA and just demonstrate really that 
there are two ways to skin this data analysis cat. The end results are the same. However, the 
multilevel model approach has the advantages that (1) we don’t need to concern ourselves 
with sphericity, and (2) we can now break down these very complicated effects by looking 
at the model parameters (which reflect the contrasts that we used to code the predictor 
variables). We can see the model parameters by executing: 

summary(speedDateModel) 

Output 14.4 shows the parameter estimates for the model (I’ve edited some of the names 
to save space and put some spaces in the table to try to group related contrasts together). 

Linear mixed-effects model fit by maximum likelihood 
Fixed effects: dateRating ~ looks + personality + gender + 


looks:personality + 

looks:gender 

+ personality 

:gender + 


looks:personality:gender 

Value 

Std.Error 

DF 

t-value 

p-value 

(Intercept) 

68.9 

1.740866 

108 

39.57800 

0.0000 

AttractivevsAv 

18.2 

2.400632 

36 

7.58134 

0.0000 

UglyvsAv 

-17.7 

2.400632 

36 

-7.37306 

0.0000 

HighvsAv 

19.5 

2.400632 

108 

8.12286 

0.0000 

DullvsAv 

-21.9 

2.400632 

108 

-9.12260 

0.0000 

gender 

2.9 

2.461957 

18 

1.17792 

0.2542 

AttractivevsAv:gender 

-1.5 

3.395006 

36 

-0.44183 

0.6613 

UglyvsAv:gender 

-5.8 

3.395006 

36 

-1.70839 

0.0962 

HighvsAv:gender 

-8.5 

3.395006 

108 

-2.50368 

0.0138 

Dul1vsAv:gender 

-2.1 

3.395006 

108 

-0.61856 

0.5375 

AttractivevsAv:HighvsAv 

-17.0 

3.395006 

108 

-5.00736 

0.0000 

UglyvsAv:HighvsAv 

16.0 

3.395006 

108 

4.71280 

0.0000 

AttractivevsAv:DullvsAv 

-13.4 

3.395006 

108 

-3.94697 

0.0001 

UglyvsAv: DullvsAv 

16.8 

3.395006 

108 

4.94845 

0.0000 
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FIGURE 14.3 

Error bar graph of 
the main effect of 
gender 


AttractivevsAv:HighvsAv:gender 

5.8 

4.801263 

108 

1.20802 

0.2297 

UglyvsAv:HighvsAv:gender 

LO 

CO 
\—1 

1 

4.801263 

108 

-3.85315 

0.0002 

AttractivevsAv:DullvsAv:gender 

36.2 

4.801263 

108 

7.53968 

0.0000 

UglyvsAv:DullvsAv:gender 

4.7 

4.801263 

108 

0.97891 

0.3298 


Output 14.4 


14 . 6 . 3 . 


The main effect of gender © 


We saw in Output 14.3 that gender did not have a significant overall effect on ratings of the 
dates, % 2 (1) = 0.002, p = .966. This effect tells us that if we ignore all other variables, male 
participants’ ratings were basically the same as those of female participants. 



SELF-TEST 

s Using ggplot2 and stat.desc , plot an error bar graph 
and get the means for the main effect of gender. 
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Gender 


Output 14.5 is a table of means for the main effect of gender with the associated stand¬ 
ard errors. This information is plotted in Figure 14.3. It is clear from this graph that 
men and women’s ratings were generally the same when we ignore the other predictors. 
However, remember that because there are significant interactions involving this main 
effect we shouldn’t really interpret it (because the higher-order interactions supersede it). 
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speedData$gender: Male 

median mean SE.mean Cl.mean.0.95 var std.dev coef.var 
71.000 68.600 1.961 3.896 346.018 18.602 0.271 


speedData$gender: Female 

median mean SE.mean Cl.mean.0.95 var std.dev coef.var 
67.500 68.533 2.036 4.046 373.218 19.319 0.282 

Output 14.5 


14 . 6 . 4 . 


The main effect of looks © 



SELF-TEST 

s Based on the previous section and what you have 
learned in previous chapters, can you interpret the 
main effect of looks? 


We came across the main effect of looks in Output 14.3. Now we’re going to have a look 
at what this effect means. We can report that the attractiveness of the date had a significant 
effect on ratings, / 2 (2) = 68.30, p < .0001. This effect tells us that if we ignore all other 
variables, ratings were different for attractive, average and unattractive dates. 



SELF-TEST 

s Using ggplot2 and stat.desc, plot an error bar graph 
and get the means for the main effect of looks. 



speedData$looks: Attractive 


median 

86.00 

mean 

82.10 

SE.mean 

1.90 

Cl.mean.0.95 

3.81 

var 

217.52 

std.dev 

14.75 

coef.var 

0.18 

speedData$looks: 

Average 





median 

mean 

SE.mean 

Cl.mean.0.95 

var 

std.dev 

coef.var 

70.000 

67.783 

2.181 

4.364 

285.359 

16.893 

0.249 

speedData$looks: 

Ugly 





median 

mean 

SE.mean 

Cl.mean.0.95 

var 

std.dev 

coef.var 

50.000 

55.817 

1.957 

3.917 

229.881 

15.162 

0.272 


Output 14.6 

Output 14.6 is a table of means for the main effect of looks with the associated standard 
errors. To make things easier, this information is plotted in Figure 14.4. You can see that 
as attractiveness falls, the mean rating falls too. So this main effect seems to reflect that 
the raters were more likely to express a greater interest in going out with attractive people 
than average or ugly people. However, we really need to look at some contrasts to find out 
exactly what’s going on. 
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FIGURE 14.4 

Error bar graph of 
the main effect of 
looks 


100- 



Attractive Average Ugly 


Attractiveness 


Output 14.4 shows the contrasts that we requested. The first contrast that we set 
( AttractivevsAv ) shows that attractive dates were rated significantly higher than average 
dates, b = 18.2, t(36) = 7.58, p < .001. The second contrast ( UglyvsAv ) shows that aver¬ 
age dates were rated significantly higher than ugly ones, —17.7, t(36) = —7.37, p < .001. 
Remember that because there are significant interactions involving looks, we shouldn’t 
really interpret the main effect (because the higher-order interactions supersede it). 


14 . 6 . 5 . 


The main effect of personality © 


The main effect of personality is in Output 14.3. We can report that there was a significant 
main effect of charisma, x 2 (2) = 138.76, p < .0001. This effect tells us that if we ignore 
all other variables, ratings were different for highly charismatic, averagely charismatic and 
dullard people. 




SELF-TEST 

s Using ggplot2 and stat.desc , plot an error bar 
graph and get the means for the main effect of 

personality. 


Output 14.7 and Figure 14.5 show that as charisma declines, the mean rating falls too. 
So this main effect seems to reflect that the raters were more likely to express a greater 
interest in going out with charismatic people than average people or dullards. 
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speedData$personality: Charismatic 

median mean SE.mean Cl.mean.0.95 var std.dev coef.var 

86.000 82.100 1.704 3.409 174.193 13.198 0.161 


speedData$personality: Average 

median mean SE.mean Cl.mean.0.95 var std.dev coef.var 

71.00 69.30 2.15 4.30 276.96 16.64 0.24 


speedData$personality: Dullard 

median mean SE.mean Cl.mean.0.95 var std.dev coef.var 

48.000 54.300 2.000 4.002 240.010 15.492 0.285 

Output 14.7 


100 - 



Charismatic Average Dullard 


Charisma 


FIGURE 14.5 

Error bar graph of 
the main effect of 
personality 


Output 14.4 shows the contrasts that we requested. The first contrast that we set 
( HigbvsAv ) shows that highly charismatic dates were rated significantly higher than dates 
with average charisma, b = 19.5, £(108) = 8.12, p < .001. The second contrast ( DullvsAv) 
shows that dates with average charisma were rated significantly higher than dullards, b = 
—21.9, £(108) = —9.12, p < .001. Remember that because there are significant interactions 
involving personality, we shouldn’t really interpret the main effect (because the higher- 
order interactions supersede it). 


14 . 6 . 6 . 


The interaction between gender and looks © 


Output 14.3 indicated that gender interacted in some way with the attractiveness of the 
date. We can report that there was a significant interaction between the attractiveness 
of the date and the gender of the participant, / 2 (2) = 39.54, p < .0001. This effect tells 
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us that the profile of ratings across dates of different attractiveness was different for 
men and women. 




SELF-TEST 

s Using ggplot2 and stat.desc, plot a line graph and 
get the means for the looks x gender interaction. 


: Attractive 
: Male 


median 

88.000 

mean 

88.033 

SE.mean 

0.996 

Cl.mean.0.95 

2.037 

var 

29.757 

std.dev 

5.455 

coef.var 

0.062 

: Average 

: Male 

median 

mean 

SE.mean 

Cl.mean.0.95 

var 

std.dev 

coef.var 

71.000 

67.467 

2.873 

5.876 

247.637 

15.736 

0.233 

: Ugly 
: Male 

median 

mean 

SE.mean 

Cl.mean.0.95 

var 

std.dev 

coef.var 

48.500 

50.300 

1.239 

2.535 

46.079 

6.788 

0.135 

: Attractive 

: Female 

median mean 

SE.mean 

Cl.mean.0.95 

var 

std.dev 

coef.var 

82.500 

76.167 

3.366 

6.885 

339.937 

18.437 

0.242 

: Average 
: Female 

median 

mean 

SE.mean 

Cl.mean.0.95 

var 

std.dev 

coef.var 

67.500 

68.100 

3.330 

6.811 

332.714 

18.240 

0.268 

: Ugly 
: Female 

median 

mean 

SE.mean 

Cl.mean.0.95 

var 

std.dev 

coef.var 

52.500 

61.333 

3.458 

7.072 

358.644 

18.938 

0.309 


Output 14.8 

The means and interaction graph (Figure 14.6 and Output 14.8) shows the meaning 
of this result. The graph shows the average male ratings of dates of different attractive¬ 
ness ignoring how charismatic the date was (blue line). The women’s scores are shown 
as a black line. The graph clearly shows that male and female ratings are very similar for 
average-looking dates, but men give higher ratings (i.e., they’re really keen to go out with 
these people) than women for attractive dates, but women express more interest in going 
out with ugly people than men do. In general, this interaction seems to suggest that men’s 
interest in dating a person is more influenced by their looks than for females. Although 
both males’ and females’ interest decreases as attractiveness decreases, this decrease is more 
pronounced for men. This interaction can be clarified using the contrasts in Output 14.4. 
However, we wouldn’t normally interpret this interaction because the significant higher- 
order three-way interaction supersedes it. 
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FIGURE 14.6 

Graph of the 
interaction 
between looks and 
gender 


20 - 

o - 

l 

Attractive 


Average 

Attractiveness 


Ugly 


14.6.6.1. Looks x gender interaction 1: 
attractive vs. average, male vs. female © 

The first contrast for the looks x gender interaction term (AttractivevsAv -.gender) compares 
male and female ratings of attractive relative to average-looking dates. This contrast is 
not significant, b = —1.5, t(36) = —0.44, p = .661. This result tells us that the increased 
interest in attractive dates compared to average-looking dates found for men is not signifi¬ 
cantly more than for women. So, in Figure 14.6 the slope of the blue line (men) between 
the attractive dates and average dates is not steeper than the black line (females). We can 
conclude that the preferences for attractive dates, compared to average-looking dates, are 
similar for males and females. 


14.6.6.2. Looks x gender interaction 2: 
ugly vs. average, male vs. female © 

The second contrast ( UglyvsAv.-gender ) compares male and female ratings of ugly relative 
to average-looking dates. This contrast is not significant, b = —5.8, 7(36) = —1.71, p = .096, 
which suggests that the decreased interest in ugly dates compared to average-looking dates 
found for male raters is not significantly different than for female raters. In Figure 14.6 the 
slope of the blue line (men) between the ugly dates and average dates is not steeper than 
the black line (females). We can conclude that the preferences for average-looking dates, 
compared to ugly dates, are similar for males and females. 
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14 . 6 . 7 . 


The interaction between gender and personality (D 


Gender interacted with how charismatic the date was (Output 14.3). We can report that 
there was a significant interaction between the attractiveness of the date and the gender 
of the participant, x 2 (2) = 57.96, p < .0001. This effect suggests that the profile of ratings 
across dates of different levels of charisma was different for men and women. 




SELF-TEST 

s Using ggplot2 and stat.desc , plot a line graph 
and get the means for the personality x gender 
interaction. 


: Charismatic 
: Male 


median 

82.00 

mean 

75.97 

SE.mean 

2.77 

Cl.mean.0.95 

5.67 

var 

230.72 

std.dev 

15.19 

coef.var 

0.20 

: Average 

median 

71.000 

: Male 

mean 

69.533 

SE.mean 

3.197 

Cl.mean.0.95 

6.538 

var 

306.533 

std.dev 

17.508 

coef.var 

0.252 

: Dullard 

: Male 

median 

49.00 

mean 

60.30 

SE.mean 

3.63 

Cl.mean.0.95 

7.43 

var 

396.36 

std.dev 

19.91 

coef.var 

0.33 

: Charismatic 

: Female 

median mean 

89.0000 88.2333 

SE.mean 

1.2361 

Cl.mean.0.95 

2.5282 

var 

45.8402 

std.dev 

6.7705 

coef.var 

0.0767 

: Average 
: Female 

median 

68.000 

mean 

69.067 

SE.mean 

2.926 

Cl.mean.0.95 

5.984 

var 

256.823 

std.dev 

16.026 

coef.var 

0.232 

: Dullard 

: Female 

median 

48.0000 

mean 

48.3000 

SE.mean 

0.7629 

Cl.mean.0.95 

1.5602 

var 

17.4586 

std.dev 

4.1784 

coef.var 

0.0865 


Output 14.9 


The means tell us the meaning of this interaction (see Figure 14.7 and Output 14.9). The 
graph shows the average male ratings of dates of different levels of charisma, ignoring how 
attractive they were (blue line). The women’s scores are shown as a black line. The graph 
shows almost the reverse pattern as for the attractiveness data; again male and female rat¬ 
ings are very similar for dates with average amounts of charisma, but this time men show 
more interest in dates who are dullards than women do, and women show slightly more 
interest in very charismatic dates than men do. In general, this interaction seems to sug¬ 
gest than women’s interest in dating a person is more influenced by their charisma than 
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FIGURE 14.7 

Graph of the 
interaction 
between charisma 
and gender 


20 - 

o - 

Charismatic Average Dullard 

Charisma 


for men. Although both males’ and females’ interest decreases as charisma decreases, this 
decrease is more pronounced for females. This interaction can be clarified using the con¬ 
trasts in Output 14.4. However, we wouldn’t normally interpret this interaction because 
the significant higher-order three-way interaction supersedes it. 


14.6.7.1. Personality x gender interaction 1: high vs. some 
charisma, male vs. female © 

The first contrast for this interaction term (HighvsAv.gender) looks at high charisma com¬ 
pared to average charisma, comparing male and female scores. This contrast is significant, 
b = —8.5, t(108) = —2.50, p = .014. This result tells us that the increased interest in highly 
charismatic dates compared to averagely charismatic dates found for women is significantly 
more than for men. So, in Figure 14.7 the slope of the black line (women) between the 
charismatic dates and dates with average charisma is steeper than the equivalent blue line 
(men). We can conclude that the preferences for very charismatic dates, compared to aver¬ 
agely charismatic dates, are significantly greater for females than males. 


14.6.7.2. Personality x gender interaction 2: 
dullard vs. some charisma, male vs. female © 

The second contrast for this interaction term (DullvsAv.-gender) looks at differences in male 
and female ratings of dullards compared to dates with average charisma. This contrast is 
not significant, b = —2.1, t(108) = —0.62, p = .538. This result tells us that the decreased 
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interest in dull dates compared to averagely charismatic dates found for women is not 
significantly more than for men. So, in Figure 14.7 the slope of the black line (females) 
between dates with some charisma and dullard dates is not significantly steeper than the 
corresponding blue line (males). We can conclude that the preferences for dates with some 
charisma over dullards are similar for females than males. 


14 . 6 . 8 . 


The interaction between looks and personality (D 


Output 14.3 indicated that the attractiveness of the date interacted in some way with how 
charismatic the date was. We can report that there was a significant interaction between the 
attractiveness of the date and the charisma of the date, x 2 (4) = 77.14, p < .0001. This effect 
tells us that the profile of ratings across dates of different levels of charisma was different 
for attractive, average and ugly dates. 




SELF-TEST 

s Using ggplot2 and stat.desc , plot a line graph 
and get the means for the looks x personality 
interaction. 


The means tell us the meaning of this interaction (see Output 14.10 and Figure 14.8). 
The graph shows the average ratings of dates of different levels of attractiveness when the 
date also had high levels of charisma (black), some charisma (light blue) and no charisma 
(blue). Look first at the difference between attractive and average-looking dates. The interest 
in highly charismatic dates doesn’t change (the line is more or less flat between these two 
points), but for dates with some charisma or no charisma interest levels decline. So, if you 
have lots of charisma you can get away with being average-looking and people will still want 
to date you. Now, if we look at the difference between average-looking and ugly dates, a dif¬ 
ferent pattern is observed. For dates with no charisma (blue) there is little difference between 
ugly and average people (so if you’re a dullard you have to be really attractive before people 
want to date you). However, for those with charisma, there is a decline in interest if you’re 
ugly (so if you’re ugly, having charisma won’t help you much). This interaction is very com¬ 
plex, but we can break it down using the contrasts in Output 14.4. However, we wouldn’t 
normally interpret this interaction because the significant higher-order three-way interaction 
supersedes it. 

: Attractive 
: Charismatic 


median mean 

89.0000 88.9500 

SE.mean 

1.3543 

Cl.mean.0.95 

2.8345 

var 

36.6816 

std.dev 

6.0565 

coef.var 

0.0681 

: Average 

: Charismatic 

median mean 

86.5000 85.6000 

SE.mean 

1.7938 

Cl.mean.0.95 

3.7546 

var 

64.3579 

std.dev 

8.0223 

coef.var 

0.0937 

: Ugly 

: Charismatic 

median mean 

73.000 71.750 

SE.mean 

3.639 

Cl.mean.0.95 

7.616 

var 

264.829 

std.dev 

16.274 

coef.var 

0.227 
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: Attractive 
: Average 


median 

86.5000 

mean SE.mean 

87.8000 1.3795 

Cl.mean.0.95 

2.8874 

var 

38.0632 

std.dev 

6.1695 

coef.var 

0.0703 

: Average 
: Average 

median 

71.0000 

mean SE.mean 

70.3500 1.1883 

Cl.mean.0.95 

2.4871 

var 

28.2395 

std.dev 

5.3141 

coef.var 

0.0755 


: Ugly 
: Average 
median mean 

49.50 49.75 


Attractive 

Dullard 


median 

68.000 

mean 

69.550 

SE.mean 

4.191 

Cl.mean.0.95 

8.772 

var 

351.313 

std.dev 

18.743 

coef.var 

0.269 

: Average 
: Dullard 

median 

48.000 

mean 

47.400 

SE.mean 

0.869 

Cl.mean.0.95 

1.818 

var 

15.095 

std.dev 

3.885 

coef.var 

0.082 

: Ugly 
: Dullard 

median 

46.0000 

mean 

45.9500 

SE.mean 

0.7272 

Cl.mean.0.95 

1.5220 

var 

10.5763 

std.dev 

3.2521 

coef.var 

0.0708 


Output 14.10 

FIGURE 14.8 

Graph of the 
interaction 
between looks and 
personality 


Charisma 

Charismatic 
Average Charisma 
Dullard 



SE.mean Cl.mean.0.95 var std.dev coef.var 
1.22 2.56 29.99 5.48 0.11 


Attractive 


Average 

Charisma 


Ugly 
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14.6.8.1. Looks x personality interaction 1: attractive vs. 
average, high charisma vs. some charisma © 

The first contrast for this interaction term (AttractivevsAv-.HighvsAv) investigates ratings of 
attractive compared to average-looking dates, when comparing charismatic dates to those 
with average charisma. This is like asking: is the difference between high charisma and 
average charisma the same for attractive people and average-looking people? The best way 
to understand what this contrast is testing is to extract the relevant bit of the interaction 
graph, which I have done in Figure 14.9. If you look at this you can see that the interest 
(as indicated by high ratings) in attractive dates was the same regardless of whether they 
had high or average charisma. However, for average-looking dates, there was more interest 
when that person had high charisma rather than average. The contrast is highly significant, 
b = —17.0, t(108) = —5.01, p < .001, and tells us that as dates become less attractive there is 
a greater decline in interest when charisma is average compared to when charisma is high. 


14.6.8.2. Looks x personality interaction 2: ugly vs. average, 
high charisma vs. some charisma © 

The second contrast for this interaction term ( UglyvsAv:HighvsAv ) investigates ratings of 
ugly compared to average looking dates when comparing charismatic to average-charisma 
dates. This is like asking: is the difference between high charisma and average charisma the 
same for ugly people and average-looking people? I have again extracted the relevant bit of 
the interaction graph (Figure 14.10). You can see that the interest (as indicated by high rat¬ 
ings) decreases from average-looking dates to ugly ones in both high- and some-charisma 
dates; however, this fall is slightly greater in the average-charisma dates (the light blue line 
is slightly steeper). The contrast is significant, b = 16.0, t(108) = 4.71, p < .001, and tells 
us that as dates become less attractive there is a greater decline in interest when charisma 
is low compared to when charisma is high. 


FIGURE 14.9 

Graph displaying 
looks x personality 
interaction 1: 
attractive vs. 
average, high 
charisma vs. 
some charisma 
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Average Charisma 


FIGURE 14.10 

Graph displaying 
looks x personality 
interaction 2: 
ugly vs. average, 
high charisma vs. 
some charisma 
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14.6.8.3. Looks x personality interaction 3: attractive vs. 
average, dullard vs. some charisma (D 

The third contrast for this interaction term ( AttractivevsAv.DullvsAv ) investigates ratings 
of attractive compared to average looking dates, when comparing dullards to dates with 
average charisma. This is like asking: is the difference between no charisma and average 
charisma the same for attractive people and average-looking people? Again, the best way to 
understand what this contrast is testing is to extract the relevant bit of the interaction graph 
(see Figure 14.11). If you look at this you can see that the interest (as indicated by high 
ratings) in attractive dates was higher when they had some charisma than when they were 
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FIGURE 14.11 

Graph displaying 
looks x personality 
interaction 
3: attractive 
vs. average, 
dullard vs. some 
charisma 
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a dullard. The same is also true for average-looking dates. In fact the two lines are fairly 
parallel. The contrast, however, is significant, b = —13.4, 7(108) = —3.95, p < .001, and 
tells us that as dates become less attractive the decline in interest is different depending on 
whether charisma is average or low. This significant contrast seems at odds with what the 
graph shows (as was the case for the previous contrast) - if this contradiction is bothering 
you then read Jane Superbrain Box 14.1. 


JANE SUPERBRAIN 14.1 

Contrasts that are significant when 
the graphs don’t seem to show 
an interaction (D 

The parameters in the model will depend on what else 
is included in the model. I have said many times in 
this chapter (and others) that you should not interpret 
main effects and interactions when a higher-order 
interaction is significant. These data are a good illus¬ 
tration of why. Contrast 3 of the looks x personal¬ 
ity interaction had a graph that showed parallel 


lines, which we usually associate with non-significant 
interactions, yet the contrast was significant. This is 
because of the influence of the higher-order looks x 
personality x gender interaction. When we built up 
the model we created a model that contained every¬ 
thing apart from the three-way interaction (this is the 
model called looksjpersonatity). Let’s have a look at 
the parameter estimates for this model by using the 
summaryO function: 

The contrast in bold is contrast 3 of the looks x per¬ 
sonality interaction, which, from the graph, looked non¬ 
significant. This contrast looks at the effect of attractive 
dates compared to average ones in dull dates relative 
to those with average charisma. When the three-way 
interaction was included this contrast was highly sig¬ 
nificant, but in its absence the contrast reflects the non¬ 
significant result that we’d expect from the graph. This 
example highlights the reason why you should interpret 
the highest-order significant effect and not worry about 
interpreting lower-order effects in the model. 



summer ry(looks_personality) 



Value 

Std.Error DF 

t-value 

p-value 

(Intercept) 

70.46667 

1.884462 

112 

37.39353 

0.0000 

AttractivevsAv 

11.20000 

2.467340 

36 

4.53930 

0.0001 

UglyvsAv 

-15.40000 

2.467340 

36 

-6.24154 

0.0000 

HighvsAv 

21.61667 

2.467340 

112 

8.76112 

0.0000 

DullvsAv 

-28.71667 

2.467340 

112 ■ 

-11.63872 

0.0000 

genderMale 

-0.23333 

2.252362 

18 

-0.10359 

0.9186 

AttractivevsAv:gender 

12.50000 

2.467340 

36 

5.06619 

0.0000 

UglyvsAv:gender 

-10.40000 

2.467340 

36 

-4.21507 

0.0002 

HighvsAv:gender 

-12.73333 

2.467340 

112 

-5.16075 

0.0000 

DullvsAv:gender 

11.53333 

2.467340 

112 

4.67440 

0.0000 

AttractivevsAv:HighvsAv 

-14.10000 

3.021861 

112 

-4.66600 

0.0000 

looksUglyvsAv:HighvsAv 

6.75000 

3.021861 

112 

2.23372 

0.0275 

AttractivevsAv:DullvsAv 

4.70000 

3.021861 

112 

1.55533 

0.1227 

UglyvsAv:DullvsAv 

19.15000 

3.021861 

112 

6.33715 

0.0000 
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14.6.8.4. Looks x personality interaction 4: 
ugly vs. average, dullard vs. some charisma (D 

The final contrast for this interaction term ( UglyvsAv:DullvsAv ) investigates ratings of ugly 
compared to average-looking dates, when comparing dullards to dates with average charisma. 
This is like asking: is the difference between no charisma and some charisma the same for ugly 
people and average-looking people? Figure 14.12 shows the relevant bits of the interaction 
graph; you can see that the interest (as indicated by high ratings) in average-looking dates 
was higher when they had some charisma than when they were a dullard, but for ugly dates 
the ratings were roughly the same regardless of the level of charisma. This contrast is highly 
significant, b = 16.8, t(108) = 4.95, p < .001, and tells us that as dates become less attractive 
the decline in interest in dates with a bit of charisma is significantly greater than for dullards. 
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FIGURE 14.12 

Graph displaying 
looks x personality 
interaction 4: 
ugly vs. average, 
dullard vs. some 
charisma 
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14.6.9. 


The interaction between looks, personality 
and gender (D 


The three-way interaction tells us whether the looks x personality interaction described 
above is the same for men and women (i.e., whether the combined effect of attractiveness 
of the date and their level of charisma is the same for male participants as for female sub¬ 
jects). Output 14.3 tells us that there is a significant three-way looks x personality x gender 
interaction, x 2 (4) = 79.59, p < .0001. This is the highest-order effect that is significant, and 
consequently, we would ordinarily focus on interpreting this effect and not all the lower- 
order ones (which I have interpreted only for illustrative purposes). 


SELF-TEST 

s Using ggplot2 and stat.desc , plot a line graph 
and get the means for the looks x personality x 
gender interaction. 
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How do I interpret a 
three-way interaction?. 


The nature of this interaction is revealed in Figure 14.13, which shows the 
looks x personality interaction for men and women separately (the means 
on which this graph is based appear in Output 14.1). The male graph shows 
that when dates are attractive, men will express a high interest regardless of 
charisma levels (the different coloured data points overlap). At the opposite 
end of the attractiveness scale, when a date is ugly, men will express very 
little interest (ratings are all low), regardless of the date’s charisma. The 
only time charisma makes any difference to a man is if the date is average¬ 
looking, in which case high charisma boosts interest, being a dullard reduces 
interest, and having a bit of charisma leaves things somewhere in between. The take-home 
message is that men are superficial cretins who are more interested in physical attributes. 

The picture for women is very different. If someone has high levels of charisma then it 
doesn’t really matter what they look like, women will express an interest in them (the black 
line is relatively flat). At the other extreme, if the date is a dullard, then they will express no 
interest in them, regardless of how attractive they are (the dark blue line is relatively flat). 
The only time attractiveness makes a difference is when someone has an average amount of 
charisma, in which case being attractive boosts interest, and being ugly reduces it. Put another 
way, women prioritize charisma over physical appearance. Again, we can look at some con¬ 
trasts to further break this interaction down (Output 14.4). These contrasts are similar to 
those for the looks x personality interaction, but they now also take into account the effect of 
gender as well. 


FIGURE 14.13 

Graphs showing 
the looks by 
charisma 
interaction for 
men and women. 
Lines represent 
high charisma 
(black), some 
charisma (light 
blue) and no 
charisma 
(dark blue) 



Charisma 

Charismatic 
Average Charisma 
Dullard 


Attractive Average 


Ugly Attractive Average 

Attractiveness 


Ugly 


14.6.9.1. Looks x personality x gender interaction 1: 
attractive vs. average, high charisma vs. 
some charisma, male vs. female (D 

The first contrast for this interaction term compares ratings for attractive dates to average¬ 
looking dates, when high charisma is compared to average charisma in males compared to 
females, b = 5.8, t(108) = 1.21, p = .230. The interaction graph in Figure 14.14 shows that 
interest (as indicated by high ratings) in attractive dates was the same regardless of whether 
they had high or average charisma. However, for average-looking dates, there was more 
interest when that person had high charisma rather than some charisma. Most important, 
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FIGURE 14.14 

Graph displaying 
looks x 
personality x 
gender interaction 
1: attractive vs. 
average, high 
charisma vs. 
some charisma, 
males vs. females 
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this pattern of results is the same in males and females, and this is reflected in the non¬ 
significance of this contrast. 


14.6.9.2. Looks x personality x gender interaction 2: ugly vs. 
average, high charisma vs. some charisma, males vs. females (D 

The second contrast for this interaction term compares interest in ugly compared to average¬ 
looking dates, when high charisma is compared to average charisma, in men compared to 
women. The interaction graph in Figure 14.15 shows that the patterns are different for men 
and women. This is reflected by the fact that the contrast is significant, b = —18.5, t(108) 
= —3.85, p < .001. To unpick this we need to look at the graph. First, let’s look at the men. 
For men, as attractiveness goes down, so does interest when the date has high charisma and 
when they have average charisma. In fact the lines are parallel. So, regardless of charisma, 
there is a similar reduction in interest as attractiveness declines. For women the picture is 
quite different. When charisma is high, there is no decline in interest as attractiveness falls 
(the black line is flat); however, when charisma is average, the attractiveness of the date does 
matter and interest is lower in an ugly date than in an average-looking date. Another way to 
look at it is that for dates with average charisma, the reduction in interest as attractiveness 
goes down is about the same in men and women (the light blue lines have the same slope). 
However, for dates who have high charisma, the decrease in interest if these dates are ugly 
rather than average looking is much more dramatic in men than women (the black line is 
much steeper for men than it is for women). This is what the significant contrast tells us. 


14.6.9.3. Looks x personality x gender interaction 3: 
attractive vs. average, dullard vs. some charisma, 
male vs. female © 


The third contrast for this interaction term compares interest in attractive compared to 
average-looking dates, when dullards are compared to average charisma, in men compared 
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FIGURE 14.15 

Graph displaying 
looks x personality 
x gender 
interaction 2: 
ugly vs. average, 
high charisma vs. 
some charisma, 
males vs. females 
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FIGURE 14.16 

Graph displaying 
looks x personality 
x gender 
interaction 4: 
ugly vs. average, 
dullard vs. some 
charisma, males 
vs. females 
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to women. The interaction graph in Figure 14.16 shows that the patterns are different 
for men and women. This is reflected by the fact that the contrast is significant, b = 36.2, 
t(108) = 7.54, p < .001. To unpick this effect we need to look at the graph. First, if we look 
at average-looking dates, for both men and women more interest is expressed when the 
date has average charisma than when they are a dullard (and the distance between the lines 
is about the same). So the difference doesn’t appear to be here. If we now look at attractive 
dates, we see that men are equally interested in their dates regardless of their charisma, but 
women are much less interested in an attractive person if they are a dullard. Put another 
way, for attractive dates, the distance between the lines is much smaller for men than it is 
for women. Another way to look at it is that for dates with average charisma, the reduction 
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in interest as attractiveness goes down is about the same in men and women (the black lines 
have the same slope). However, for dates who are dullards, the decrease in interest if these 
dates are average-looking rather than attractive is much more dramatic in men than women 
(the light blue line is much steeper for men than it is for women). 


14.6.9.4. Looks x personality x gender interaction 4: ugly vs. 
average, dullard vs. some charisma, male vs. female © 

The final contrast for this interaction term compares interest in ugly compared to average¬ 
looking dates, when comparing dullards to average charisma, in men compared to women. 
The interaction graph in Figure 14.17 shows that interest (as indicated by high ratings) in 
ugly dates was the same regardless of whether they had average charisma or were a dull¬ 
ard. However, for average-looking dates, there was more interest when that person had 
some charisma rather than if they were a dullard. Most important, this pattern of results is 
similar in males and females, and this is reflected in the non-significance of this contrast, b 
= 4.7, t(108) = 0.98, p = . 330. 
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FIGURE 14.17 

Graph displaying 
looks x personality 
x gender 
interaction 4: 
ugly vs. average, 
dullard vs. some 
charisma, males 
vs. females 
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14.6.10. 


Conclusions © 


These contrasts tell us nothing about the differences between the attractive and ugly condi¬ 
tions, or the high-charisma and dullard conditions, because these were never compared. We 
could rerun the analysis and specify our contrasts differently to get these effects. However, 
what is clear from our data is that differences exist between men and women in terms of 
how they’re affected by the looks and personality of potential dates. Men appear to be 
enthusiastic about dating anyone who is attractive, regardless of how awful their personal¬ 
ity. Women are almost completely the opposite: they are enthusiastic about dating anyone 











640 


DISCOVERING STATISTICS USING R 


with a lot of charisma, regardless of how they look (and are unenthusiastic about dating 
people without charisma regardless of how attractive they look). The only consistency 
between men and women is when there is some charisma (but not lots), in which case 
for both genders the attractiveness influences how enthusiastic they are about dating the 
person. 

What should be even clearer from this chapter is that when more than two independent 
variables are used in a model, it yields complex interaction effects that require a great deal 
of concentration to interpret (imagine interpreting a four-way interaction). Therefore, it 
is essential to take a systematic approach to interpretation, and plotting graphs is a par¬ 
ticularly useful way to proceed. It is also advisable to think carefully about the appropriate 
contrasts to use to answer the questions you have about your data. It is these contrasts that 
will help you to interpret interactions, so make sure you select sensible ones. 



CRAMMING SAM’S TIPS 


Multilevel models 


• The multilevel model approach is a more flexible approach to analysing mixed designs. You can also forget about sphericity. 

• It makes it easy to include contrasts to break apart interaction effects. Set appropriate contrasts for all predictors before you 
begin. 

• Build the model up one predictor at a time so that you can test the overall effect of each predictor. 

• If you build models up hierarchically, you can compare them using the anovaf) function. If each model contains only one 
additional predictor then by comparing models you can see the effect of each predictor as it is added to the model. 

• When you have a model with all predictors and interactions included, you can look at the model parameters to see the con¬ 
trasts that you have set. These will help you to break down any interaction effects. If a contrast has a value of p less than .05 
we consider it significant. 

• Begin by interpreting the highest-order effect (i.e., the significant interaction that contains the most predictors). You should 
not interpret any lower order effects contained within that interaction. For example, if the a x b interaction is significant then 
don’t interpret the main effects of a or b; similarly if the a x b x c interaction is significant then don’t interpret the a x b, a x 
c, or b x c interactions or the main effects of a, b or c. 


14.7. Calculating effect sizes ® 



I keep emphasizing the fact that effect sizes are really more useful when they summarize a 
focused effect. This also gives me a useful excuse to circumvent the complexities of omega 
squared in mixed designs (it’s the road to madness, I assure you). Therefore, just calculate 
effect sizes for your contrasts when you’ve got a factorial design (and any main effects 
that compare only two groups). 4 Output 14.4 shows the values for several contrasts, all 
of which have a t-value and associated degrees of freedom. We can compute approximate 
effect sizes in the same way that we did for repeated-measures designs, using: 



4 Of course if you have used ezANOVA() then you could report generalized eta squared for your effects (ges in 
Output 14.2); however, I question how useful this kind of effect size is for effects with more than two groups, 
and for interaction terms. 
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Remember that in section 10.7 we wrote a function to compute this called rcontrast(), 
which you should be able to use if you have the package associated with this book, DSUR, 
loaded - see section 3.4.5). We can get the effect sizes simply by executing: 

rcontrast(t, df) 

in which t is the value of t for the effect that you want to quantify and df is its associated degrees 
of freedom. We should really only quantify the highest-order interaction because other effects 
in Output 14.4 are not interesting, given that the three-way interaction is significant. 

Therefore, we can get the effect sizes by executing rcontrast() for each of the four con¬ 
trasts for the three-way interaction: 

rcontrastC-1.20802, 108) 

[1] "r = 0.115464310595437" 

rcontrast(3.85315, 108) 

[1] "r = 0.347643452246021" 

rcontrast(-7.53968, 108) 

[1] "r = 0.587236020509728" 

rcontrastC-0.97891, 108) 

[1] "r = 0.0937805285056477" 

In other words, we get: 




r = .12 

Attractive vs. Average, High vs. Average, Male vs. Female ’ 5 

r =35 

Ugly vs. Average, High vs. Average, Male vs. Female 5 

r = .59 

Attractive vs. Average, Dull vs. Average, Male vs. Female 5 


• f 

Ugly vs. Average, Dull vs. Average, Male vs. Female 


.09. 


The two effects that were significant (attractive vs. average, dullard vs. some, male vs. 
female and ugly vs. average, high vs. some, male vs. female) yielded fairly substantial effect 
sizes. The two effects that were not significant yielded fairly small effect sizes. 


14.8. Reporting the results of mixed AN0VA © 


As you’ve probably gathered, when you have more than two independent variables there’s 
a hell of a lot of information that people tend to report. They report all of the main effects, 
all of the interactions and any contrasts they may have done. This can take up a lot of space 
and one good tip is: reserve the detail for the effects that actually matter (e.g., main effects 
and lower-order interactions should not be interpreted if you’ve got significant higher- 
order interactions that include those variables). I’m a big fan of giving brief explanations of 
results in the results section to really get the message across about what a particular effect 
is telling us, and so I tend to not just report results, but offer some interpretation as well. 
Having said that, some journal editors are big fans of telling me my results sections are too 
long. So, you should probably ignore everything I say. 

If you’ve taken the ANOVA approach, then, you could report something like this 
(although not as a list!): 

S All effects are reported as significant at p < .05. There were significant main effects 
of the attractiveness of the date, F( 2, 36) = 423.73, and the amount of charisma the 
date possessed, F( 2, 36) = 328.25 on interest expressed by the participant. However, 
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the ratings from male and female participants were, in general, the same, F( 1, 18) < 
1, r=.02. 

^ There were significant interaction effects of the attractiveness of the date and the 
gender of the participant, F( 2, 36) = 80.43, the level of charisma of the date and the 
gender of the participant, F( 2, 36) = 62.45, and the level of charisma of the date and 
the attractiveness of the date, F(4, 72) = 36.63. 

S Most important, the looks x personality x gender interaction was significant, F(4, 72) 
= 24.12. This indicates that the looks x personality interaction described previously 
was different in male and female participants. 

If you have used a multilevel model then you’d report something like this: 

^ There were significant main effects of the attractiveness of the date, x 2 (2) = 68.30, 
p < .0001, and the amount of charisma the date possessed, / 2 (2) = 138.76, p < .0001, 
on interest expressed by the participant. However, the ratings from male and female 
participants were, in general, the same, / 2 (1) = 0.002, p = .966. 

'S There were significant interaction effects of the attractiveness of the date and the 
gender of the participant, x 2 (2) = 39.54, p < .0001, the level of charisma of the date 
and the gender of the participant, x 2 (2) = 57.96, p < .0001, and the level of charisma 
of the date and the attractiveness of the date, x 2 (4) = 77.14, p < .0001. 

^ Most important, the looks x personality x gender interaction was significant, x 2 (4) = 
79.59, p < .0001. This indicates that the looks x personality interaction described pre¬ 
viously was different in male and female participants. Contrasts were used to break 
down this interaction; these contrasts compared male and females scores at each level 
of charisma compared to the middle category of ‘average charisma’ across each level 
of attractiveness compared to the category of average attractiveness. The first con¬ 
trast revealed a non-significant difference between male and female responses when 
comparing attractive dates to average-looking dates when the date had high charisma 
compared to some charisma, b = 5.8, t(108) = 1.21, p = .230, r = .12, and tells us 
that for both males and females, as dates become less attractive there is a greater 
decline in interest when charisma is average compared to when it is high. The second 
contrast looked for differences between males and females when comparing ugly 
dates to average-looking dates when the date had high charisma compared to average 
charisma. This contrast was significant, b = —18.5, t(108) = —3.85, p < .001, r= .35, 
and tells us that for dates with average charisma, the reduction in interest as attrac¬ 
tiveness goes down is about the same in men and women, but for dates who have high 
charisma, the decrease in interest if these dates are ugly rather than average-looking is 
much more dramatic in men than women. The third contrast investigated differences 
between males and females when comparing attractive dates to average-looking dates 
when the date was a dullard compared to when they had average charisma. This con¬ 
trast was significant, b = 6.2, t(108) = 7.54, p < .001, r= .59, and tells us that for dates 
with average charisma, the reduction in interest as attractiveness goes down is about 
the same in men and women, but for dates who are dullards, the decrease in interest 
if these dates are average-looking rather than attractive is much more dramatic in 
men than women. The final contrast looked for differences between men and women 
when comparing ugly dates to average-looking dates when the date was a dullard 
compared to when they had average charisma. This contrast was not significant, b = 
4.7 , t(108) = 0.98, p = .330, r = .09, and tells us that for both men and women, as 
dates become less attractive the decline in interest in dates with average charisma is 
greater than for dullards. 
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Labcoat Leni’s Real Research 14.1 


Keep the faith(ful)? (D 


Schutzwohl, A. (2008). Personality and Individual Differences, 44, 633-644. 


People can be jealous. People can be especially jealous when they think that their partner is being unfaithful. An 
evolutionary view of jealousy suggests that men and women have evolved distinctive types of jealousy because 
male and female reproductive success is threatened by different types of infidelity. Specifically, a woman’s sexual 
infidelity deprives her mate of a reproductive opportunity and in some cases burdens him with years investing in 
a child that is not his. Conversely, a man’s sexual infidelity does not burden his mate with unrelated children, but 
may divert his resources from his mate’s progeny. This diversion of resources is signalled by emotional attach¬ 
ment to another female. Consequently, men’s jealousy mechanism should have evolved to prevent a mate’s 
sexual infidelity, whereas in women it has evolved to prevent emotional infidelity. If this is the case then men and 
women should divert their attentional resources towards different cues to infidelity: women should be ’on the look¬ 
out’ for emotional infidelity, whereas men should be watching out for sexual infidelity. 

Achim Schutzwohl put this theory to the test in a unique study in which men and women saw sentences 
presented on a computer screen (Schutzwohl, 2008). On each trial, participants saw a target sentence that was 
always emotionally neutral (e.g., The gas station is at the other side of the street’). However, the trick was that 
before each of these targets, a distractor sentence was presented that could also be affectively neutral, or could 
indicate sexual infidelity (e.g., ‘Your partner suddenly has difficulty becoming sexually aroused when he and you 
want to have sex’) or emotional infidelity (e.g., 'Your partner doesn’t say “I love you” to you anymore’). The idea 
was that if these distractor sentences grabbed a person’s attention then (1) they would remember them, and (2) 
they would not remember the target sentence that came afterwards (because their attentional resources were still 
focused on the distractor). These effects should show up only in people currently in a relationship. The outcome 
was the number of sentences that a participant could remember (out of 6), and the predictors were whether the 
person had a partner or not (Relationship), whether the trial used a neutral distractor, an emotional infidelity 
distractor or a sexual infidelity distractor, and whether the sentence was a distractor or the target following the 
distractor. Schutzwohl analysed men and women’s data separately (presumably to avoid having to interpret a 
hideous four-way interaction). The predictions are that women should remember more emotional infidelity sen¬ 
tences (distractors) but fewer of the targets that followed those sentences (target). For men, the same effect 
should be found but for sexual infidelity sentences. 

The data from this study are in the file Schiitzwohl(2008).dat. Labcoat Leni wants you to carry out 
j two three-way mixed ANOVAs (one for men and the other for women) to test these hypotheses. Answers 

are in the additional material on the companion website (or look at pages 638-642 in the original article). 



14.9. Robust analysis for mixed designs © 

If I had £1 (or $1, €1 or whatever currency you fancy) for every time someone had told me 
with 100% confidence that there was no ‘non-parametric’ equivalent of mixed ANOVA, 
then I’d have a nice shiny new drum kit. Contrary to this popular assertion, there are robust 
methods that can be used (see section 5.8.4) based on trimmed means and M-estimators 
that are described in Rand Wilcox’s book (Wilcox, 2005). Wilcox also makes available 
functions to do these tests in R. To access these tests we need to load the WRS package (see 
section 5.8.4.). There are four functions that we will look at: 

• tsplit(): This performs a two-way mixed ANOVA on trimmed means. 

• sppba(): This computes the main effect of factor A of a two-way mixed design using 
an M-estimator and bootstrap. 
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• sppbb(): This computes the main effect of factor B of a two-way mixed design using 
an M-estimator and bootstrap . 

• sppbi(): This computes the A x B interaction of a two-way mixed design using an 
M-estimator and bootstrap. 



There is not a function for analysing a three-way mixed design like the main example in the 
chapter, so we’ll use a different example. 

My wife has a theory that she has received fewer friend requests from random men on 
Facebook since she changed her profile picture to a photo of us both. Like the geeky bof¬ 
fin couple we are, we decided to think about ways you could test her theory scientifically. 
We could systemmatically manipulate how people present themselves on social networking 
sites, and measure how many friend request they get from people they don’t know. In our 
fertile imaginations, we took 40 women who had profiles on a social networking website; 
17 of them had a relationship status of ‘single’ and the remaining 23 had their status as ‘in 
a relationship’. We asked them not to change this status and this acted as a between-group 
variable (relationship_status). We believed that people would get fewer requests from 
strangers if they were in a relationship. Over a 6-week period we asked these women to set 
their profile picture to a photo of them on their own (alone) and to count how many friend 
requests they got from men they didn’t know, then to switch it to a photo of them with a 
man (couple) and again record their friend requests from random men. Each profile picture 
was up for 3 weeks, and the order in which women displayed the two types of picture was 
randomized. This is a mixed design with relationship status as the between-group variable, 
type of profile picture as the repeated-measures variable, and the number of friend requests 
from strange men the outcome. 

The data are in the file ProfilePicture.dat. Set your working directory to the location of 
this file and load the data into a dataframe by executing: 


pictureData<-read.delim("ProfilePicture.dat", header = TRUE) 


The data are currently in this format (I’ve edited out some cases): 


case 

relationship_status 

couple 

alone 

i 

l 

In 

a Relationship 

4 

4 

2 

2 

In 

a Relationship 

4 

6 

3 

3 

In 

a Relationship 

4 

7 

4 

4 

In 

a Relationship 

3 

5 

36 

36 


Single 

5 

10 

37 

37 


Single 

4 

8 

38 

38 


Single 

6 

9 

39 

39 


Single 

7 

10 

40 

40 


Single 

3 

5 

The 

variables in 

each column are described 

above. 




SELF-TEST 

s Using ggplot2 and stat.desc, plot a line graph and 
get the means for the relationship_status x profile 
picture interaction. 
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The first problem we have is that the robust functions need the data to be in wide format 
rather than long (see Chapter 3). Figure 14.18 shows the existing data format and how we 
need it to look (wide). Essentially we want levels of our two factors to be represented in 
different columns. Our repeated measure (type of picture) is already spread across differ¬ 
ent columns (couple and alone), but relationship status is differentiated by different rows 
of data (rows 1-17 are those in a relationship whereas 18-40 are single). Therefore, we 
need to take the rows representing people who are single and shift them into two columns 
alongside the columns currently labelled couple and single. 

We can do this restructuring using the melt() and castQ functions from the reshape pack¬ 
age. To get the restructuring to work, we need to add a variable to our dataframe that 
identifies the rows in the wide format. Notice in Figure 14.18 that the data are made up of 
four chunks that represent the combinations of the type of picture and relationship_status, 
and each chunk contains several rows. We want to move the chunks that are currently 
stacked on top of each other so that they are beside each other (Figure 14.18). To do this, 
R needs to know what row a particular score will end up in when we move each block of 
scores from the stacks into the columns. The easiest approach is simply to create a variable 
(called row) that identifies within each chunk the row number of a given score. In other 



FIGURE 14.18 

Restructuring the 
data for robust 
mixed ANOVA 
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words, it will be a value telling us whether the score is the first, second, third, etc. score 
within the chunk. At the moment the chunks are stacked on top of each other, so we want 
a variable that is the sequence of numbers 1 to 17 for the first chunk and 1 to 23 for the 
second (because the relationship status groups contain 17 and 23 people, respectively). We 
can add this variable to the dataframe by executing: 

pictureData$row<-c(l:17, 1:23) 

This command creates a variable row in the dataframe pictureData, that is, the numbers 
1 to 17 followed by the numbers 1 to 23. The structure of the data will be the same as 
before, it’s just that we have a new variable called row that identifies the scores within each 
relationship status group. 

Next we need to make it molten so that we can cast the data into the wide format. To do 
this we use the melt() function (see section 3.9.4). Remember that in this function we dif¬ 
ferentiate variables that identify attributes of the scores (in this case, case, relationship_sta- 
tus and row all tell us about a given score, for example, that it was the third score in the 
‘single’ group) from the scores or measured variables themselves (in this case the columns 
labelled couple and alone both contain scores). Attributes are specified with the id option, 
and scores with the measured option. Therefore, we can create a molten dataframe called 
profileMelt by executing: 

profileMelt<-melt(pictureData, id = c("case", "row", "relationship_status"), 
measured = c("couple", "alone")) 

The data now look like this (I have edited out many cases to save space): 



case 

row 

relationship_status 

variable 

value 

1 

1 

1 

In 

a 

Relationship 

couple 

4 

2 

2 

2 

In 

a 

Relationship 

couple 

4 

18 

18 

1 



Single 

couple 

6 

19 

19 

2 



Single 

couple 

3 

41 

1 

1 

In 

a 

Relationship 

alone 

4 

42 

2 

2 

In 

a 

Relationship 

alone 

6 

79 

39 

22 



Single 

alone 

10 

80 

40 

23 



Single 

alone 

5 


The variable that differentiates whether the profile picture was the person alone, or 
the person alongside a man, has been labelled variable and the variable that contains the 
number of friend requests is called value. These labels are not that informative, so let’s 
rename them as profile_picture and friend jequests using the names{) function. 

names(profileMelt)<-c("case", "row", "relationship_status", "profile_picture", 
"friend_requests") 

Executing this command takes the dataframe profileMelt and assigns the names in c() to 
each column. As such, our variables all now have names that relate to what they represent. 

Finally, we want to cast our data into the wide format using cast(). To do this we use a 
formula in the form: variables specifying the rows ~ variables specifying the columns. In 
this case, row tells us in which row to place a score, and we want the relationship_status 
and profile_picture variables split across different columns, so we’d use the formula: row 
~ relationship_status + profile_picture. Therefore, we can make a wide dataframe called 
profileData by executing: 

profileData<-cast(profileMelt, row ~ relationship_status + profile_picture, 
value = "friend_requests") 
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Note that we have applied this command to the molten data set ( profileMelt ). The value = 
“friend jequests” explicitly tells the function in which column to find the outcome variables. (The 
function will work without this command because it will take an educated guess at which col¬ 
umns contains the scores, but it’s good practice to specify the outcome variable in the function.) 

The result is that the data have been transformed to the wide format. However, because 
we added the variable row to the dataframe, our new dataframe also contains this variable, 
and for the analysis we don’t want it. We can remove this variable by executing: 

profileData$row<-NULL 


If you look at the dataframe you’ll see a lovely wide format set of data (I have abbreviated 
‘in a relationship’ to ‘IAR’): 


profileData 

IAR_With Man 
4 
4 
4 

3 

4 
2 
4 
3 
3 
3 

3 

4 
3 
1 

3 

4 
4 

NA 

NA 

NA 

NA 

NA 

NA 


lAR_Alone 

4 
6 
7 

5 

3 

5 

6 

4 

7 

5 

8 
7 

6 
4 
6 
6 
7 

NA 

NA 

NA 

NA 

NA 

NA 


Single_With Man 
6 

3 

4 
4 
3 

3 

4 
4 
2 
6 
4 
4 

3 
2 

4 

5 

3 
2 

5 

4 

6 
7 
3 


Single_Alone 

8 

8 

9 

9 

10 

11 

7 
6 

8 

5 
9 

6 
5 
8 

11 

8 

7 
5 

10 

8 
9 

10 

5 


Note that because the in ‘a relationship’ group contained fewer cases (17 rather than 23) 
there are NAs in the data set. These won’t affect the functions for robust analyses. 

It’s important to note the order of the columns because this affects how we specify the 
robust analysis. In this case, the hierarchy of the independent variables is relationship_status 
followed by profile_picture. In other words, we have taken the four groups of scores and first 
divided them into in a relationship and single, then within these groups we have subdivided 
according to the type of profile picture that was used. We would say that relationship_status is 
factor A and profile_picture factor B (Figure 14.18). As such, the order of the columns reflects 
a 2 x 2 design (two levels of relationship status divided up into two levels of profile picture). 
The function tsplit() takes the general form: 

tsplitflevels of factor A, levels of factor B, data, tr = .2) 

As with other functions we’ve encountered, the level of trimming is by default 20% (tr 
= .2), but can be changed by including the tr = option. Assuming we are happy with the 
default level of trimming, we need only specify the dataframe ( profileData ) and the levels 
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of factor A (two in this case as explained above) and factor B (two in this case). Therefore, 
we can do a robust two-way factorial ANOVA based on trimmed means by executing: 

tsplit(2, 2, profileData) 

The functions sppbaQ, sppbbQ, sppbi() all have the same format: 

sppba(levels of factor A, levels of factor B, data, est = mom, nboot = 2000) 

The main differences are an option to control the number of bootstrap samples {nboot), 
and an option est = to control the M-estimator that you want to use. You can use est = 
median (to use the median) or est = mom (to use a method based on identifying and remov¬ 
ing outliers). In smaller samples you might find that est = mom throws up an error message, 
in which case switch to est = median. The default number of bootstrap samples is 599; let’s 
increase that to 2000 and run the analysis by executing: 5 

sppba(2, 2, profileData, est = mom, nboot = 2000) 

sppbb(2, 2, profileData, est = mom, nboot = 2000) 

sppbi(2, 2, profileData, est = mom, nboot = 2000) 


tsplit() 

sppba(), sppbb(), sppbi() 

$Qa 

sppba 

[1] 10.78843 

$p.value 

$Qa.siglevel 

[1] 0.001 

[ , 1] 


[1,] 0.002795259 

$psihat 

[1] -1.464194 

$Qb 

[1] 92.12093 

$con 


[ , 1] 

$Qb.siglevel 

[ 1 , ] 1 

1,1] 

[2,] -1 

[1,] 2.618876e-10 

sppbb 

$Qab 

[1] 8.167141 

$p.value 

[1] 0.0004997501 

$Qab.siglevel 


[,1] 

$center 

[1,] 0.008003836 

[1] -3.171429 


sppbi 


$p.value 
[1] 0.015 


$psihat 
[1] 1.4375 


$con 


[ , 1] 


[1, ] 1 
[2,] -1 


Output 14.11 


5 If you want to compare medians then execute: 

sppba(2, 2, profileData, est = median, nboot = 2000) 
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FIGURE 14.19 

Graph showing 
the mean number 
of friend requests 
on a social 
networking site 
from weird men 
as a function 
of a woman’s 
relationship status 
and whether their 
profile picture 
shows them alone 
or with a man 


o- 


ln a Relationship Single 

Relationship Status 


The output of these commands is shown in Output 14.11. For tsplitQ (left-hand side 
of Output 14.11) we are given a test statistic for factor A ($Qa), factor B ( $Qb ) and their 
interaction ( $Qab ) as well as the corresponding p-value ( $Qa.siglevel , $Qb.siglevel and 
SQab.siglevel, respectively). Remember that factor A was relationship status and factor 
B the profile picture used; therefore, we could conclude that there were significant main 
effects of relationship status, Q = 10.79, p = .003, and type of profile picture, Q = 92.13, 
p < .001, and a significant relationship status x type of profile picture interaction, Q = 8.17, 
p = .008. 

The sppba(), sppbbQ and sppbi() outputs (right-hand side of Output 12.8) tell us much 
the same things, and in each case we get a test statistic ($psibat) and an associated p-value 
($p.value). There were significant main effects of relationship status, = —1.46, p = .001, 
and type of profile picture,^ = -3.17, p < .001, and a significant relationship status x type 
of profile picture interaction, = 1.44, p — .015. 

These results are shown in Figure 14.19. The main effect of profile picture reflects 
the fact that more friend requests generally are made when the picture shows the 
woman on her own (the blue line is higher than the black), the main effect of relation¬ 
ship status reflects the fact that for both lines the number of requests is higher when the 
person’s status is ‘single’ than when it says ‘in a relationship’. The significant interaction 
seems to reflect the fact that the blue line is steeper than the black. In other words, the 
increases in friend requests obtained when your relationship status is ‘single’ (compared 
to ‘in a relationship’) is more when your profile picture shows you alone. Basically, in 
terms of attracting friend requests from strange men you’ve never met, your best bet 
is to say you’re single and put a picture of you alone on your profile. The weirdos will 
come in droves. 
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What have I discovered about statistics? © 


Three-way ANOVA is a confusing nut to crack. I’ve probably done hundreds of three- 
way ANOVAs in my life and still I kept getting confused throughout writing this chapter 
(and so if you’re confused after reading it it’s not your fault, it’s mine). Hopefully, what 
you should have discovered is that the general linear model is flexible enough that you 
can mix and match independent variables that are measured using the same or differ¬ 
ent participants. In addition, we’ve looked at how ANOVA is also flexible enough to 
go beyond merely including two independent variables. Hopefully, you’ve also started 
to realize why there are good reasons to limit the number of independent variables that 
you include (for the sake of interpretation). 

Of course, far more interesting than that is that you’ve discovered that men are 
superficial creatures who value looks over charisma, and that women are prepared 
to date the hunchback of Notre Dame provided he has sufficient charisma. This is 
why as a 16-18-year-old my life was so complicated, because where on earth do you 
discover your hidden charisma? Luckily for me, some girls find alcoholics appeal¬ 
ing. The girl I was particularly keen on at 16 was, as it turned out, keen on me too. 
I refused to believe this for at least a month. All of our friends were getting bored 
of us declaring our undying love for each other to them but then not speaking to 
each other; they eventually intervened. There was a party one evening and all of her 
friends had spent hours convincing me to ask her on a date, guaranteeing me that she 
would say ‘yes’. I had psyched myself up, I was going to do it: I was actually going 
to ask a girl out on a date. My whole life had been leading up to this moment and I 
must not do anything to ruin it. By the time she arrived my nerves had got the bet¬ 
ter of me and she had to step over my paralytic corpse to get into the house. Later 
on, my friend Paul Spreckley (see Figure 9.1) physically carried the girl in question 
from another room and put her next to me and then said something to the effect of 
Andy, I’m going to sit here until you ask her out’. He had a long wait but eventually, 
miraculously, the words came out of my mouth. Like the undying love of many a 
16-year-old, our love died about 2 years later. 


R packages used in this chapter 


ez 

ggpiot2 

multcomp 

nlme 


pastecs 

reshape 

WRS 
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R functions used in this chapter 


byO 

meltO 

cO 

namesO 

castO 

rcontrastO 

cbindO 

sppba() 

contrastsO 

sppbbO 

ezANOVA() 

sppbiO 

ggpioto 

stat.desc() 

giO 

summaryO 

HstO 

tsplit() 

lme() 

updated 


Key terms that I’ve discovered 

Mixed ANOVA I Mixed design 


Smart Alex’s tasks 


• Task 1: I am going to extend the example from the previous chapter (advertising and 
different imagery) by adding a between-group variable into the design. 6 To recap, 
participants viewed a total of nine mock adverts over three sessions. In these adverts 
there were three products (a brand of beer, a brand of wine, and a brand of water). 
These could be presented alongside positive, negative or neutral imagery. Over the 
three sessions and nine adverts, each type of product was paired with each type of 
imagery (read the previous chapter if you need more detail). After each advert par¬ 
ticipants rated the drinks on a scale ranging from —100 (dislike very much) through 
0 (neutral) to 100 (like very much). The design, thus far, has two independent vari¬ 
ables: the type of drink (beer, wine or water) and the type of imagery used (positive, 
negative or neutral). I also took note of each person’s gender. It occurred to me 
that men and women might respond differently to the products (because, in keeping 
with stereotypes, men might mostly drink lager whereas women might drink wine). 
Therefore, I wanted to analyse the data taking this additional variable into account. 
Now, gender is a between-group variable because a participant can be only male or 
female: they cannot participate as a male and then change into a female and partici¬ 
pate again! The data are the same as in the previous chapter (Table 13.4) and can be 
found in the file MixedAttitude.dat. Run a mixed ANOVA on these data. © 



• Task 2: Text messaging is very popular among mobile phone owners, to the point 
that books have been published on how to write in text speak (BTW, hope u no wat 
I mean by txt spk). One concern is that children may use this form of communica¬ 
tion so much that it will hinder their ability to learn correct written English. One 


6 Previously the example contained two repeated-measures variables (drink type and imagery type), but now it 
will include three variables (two repeated measures and one between-group). 
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concerned researcher conducted an experiment in which one group of children was 
encouraged to send text messages on their mobile phones over a six-month period. 
A second group was forbidden from sending text messages for the same period. To 
ensure that kids in this latter group didn’t use their phones, this group was given 
armbands that administered painful shocks in the presence of microwaves (like those 
emitted from phones). There were 50 different participants: 25 were encouraged to 
send text messages, and 25 were forbidden. The outcome was a score on a gram¬ 
matical test (as a percentage) that was measured both before and after the experi¬ 
ment. The first independent variable was, therefore, text message use (text messagers 
versus controls) and the second independent variable was the time at which gram¬ 
matical ability was assessed (before or after the experiment). The data are in the file 
TextMessages.dat. ® 

• Task 3: A researcher was interested in the effects on people’s mental health of par¬ 
ticipating in Big Brother (see Chapter 1 if you don’t know what Big Brother is). The 
researcher hypothesized that they start off with personality disorders that are exac¬ 
erbated by being forced to live with people as attention seeking as them. To test this 
hypothesis, she gave eight contestants a questionnaire measuring personality disor¬ 
ders before they entered the house, and again when they left the house. A second 
group of eight people acted as a waiting list control. These people were short-listed to 
go into the house, but never actually made it. They too were given the questionnaire 
at the same points in time as the contestants. The data are in BigBrother.dat. Conduct 
a mixed ANOVA on the data. © 



• Task 4: In this chapter we did a robust analysis on some data about how people’s 
profile pictures on social networking sites affect their friend requests. Reanalyse these 
data using a non-robust analysis. The data are in the file ProfilePicture.dat. © 

Answers can be found on the companion website. Some more detailed comments about 
task 2 can be found in Field and Hole (2003). 


Further reading 


Field, A. P. (1998). A bluffer’s guide to sphericity. Newsletter of the Mathematical, Statistical and 
Computing Section of the British Psychological Society, 6(1), 13-22. (Available in the additional 
material on the companion website.) 

Howell, D. C. (2006). Statistical methods for psychology (6th ed.). Belmont, CA: Duxbury. (Or you 
might prefer his Fundamental Statistics for the Behavioral Sciences, also in its 6th edition, 2007.) 


Interesting real research 


Schiitzwohl, A. (2008). The disengagement of attentive resources from task-irrelevant cues to sexual 
and emotional infidelity. Personality and Individual Differences, 44, 633-644. 





Non-parametric tests 





FIGURE 15.1 

In my office during 
my Ph.D., probably 
preparing some 
teaching -1 had 
quite long hair 
back then because 
it hadn't started 
falling out at that 
point 


15.1. What will this chapter tell me? © 


After my psychology degree (at City University, London) I went to the University of Sussex 
to do my Ph.D. (also in psychology) and, like many people, I had to teach to survive. Much 
to my dread, I was allocated to teach second-year undergraduate statistics. This was pos¬ 
sibly the worst combination of events that I could ever imagine. I was still very shy at the 
time, and I didn’t have a clue about statistics. Standing in front of a room full of strangers 
and trying to teach them ANOVA was only marginally more appealing than dislocating 
my knees and running a marathon - with broken glass in my trainers (sneakers). I obses¬ 
sively prepared for my first session so that it would go well; I created handouts, I invented 
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examples, I rehearsed what I would say. I went in terrified but at least knowing that if 
preparation was any predictor of success then I would be OK. About half way through the 
first session as I was mumbling on to a room of bored students, one of them rose majesti¬ 
cally from her seat. She walked slowly towards me, and I’m convinced that she was sur¬ 
rounded by an aura of bright white light and dry ice. Surely she had been chosen by her 
peers to impart a message of gratitude for the hours of preparation I had done and the skill 
with which I was unclouding their brains of the mysteries of ANOVA. She stopped beside 
me. We stood inches apart and my eyes raced around the floor looking for the reassurance 
of my shoelaces: ‘No one in this room has a rabbit 1 clue what you’re going on about’, 
she spat before storming out. Scales have not been invented yet to measure how much I 
wished I’d ran the dislocated-knees marathon that morning and then taken the day off. I 
was absolutely mortified. To this day I have intrusive thoughts about groups of students in 
my lectures walking zombie-like towards the front of the lecture theatre chanting ‘No one 
knows what you’re going on about’ before devouring my brain in a rabid feeding frenzy. 
The point is that sometimes our lives, like data, go horribly, horribly wrong. This chapter 
is about data that are as wrong as dressing a cat in a pink tutu. 


15.2. When to use non-parametric tests © 


We’ve seen in the last few chapters how we can use various techniques to look for dif¬ 
ferences between means. However, all of these tests rely on parametric assumptions (see 
Chapter 5). Data are often unfriendly and don’t always turn up in nice 
normally distributed packages! Just to add insult to injury, it’s not always 
possible to correct for problems with the distribution of a data set - so, 
what do we do in these cases? The answer is that we can use special kinds of 
statistical procedures known as non-parametric tests. 2 Non-parametric tests 
are sometimes known as assumption-free tests because they make fewer 
assumptions about the type of data on which they can be used. 3 Most of 
these tests work on the principle of ranking the data: that is, finding the 
lowest score and giving it a rank of 1, then finding the next highest score 
and giving it a rank of 2, and so on. This process results in high scores 
being represented by large ranks, and low scores being represented by small ranks. The 
analysis is then carried out on the ranks rather than the actual data. This process is an 
ingenious way around the problem of using data that break the parametric assumptions. 
Some people believe that non-parametric tests have less power than their parametric 
counterparts, but as we will see in Jane Superbrain Box 15.2 below this is not always 
true. In this chapter we’ll look at four of the most common non-parametric proce¬ 
dures: the Wilcoxon rank-sum test (which is also known as the Mann-Whitney test), 
the Wilcoxon signed-rank test, Friedman’s test and the Kruskal-Wallis test. For each of 
these we’ll discover how to carry out the analysis in R and how to interpret and report 
the results. 



1 She didn’t say ‘rabbit’, but she did say a word that describes what rabbits do a lot; it begins with an ‘f ’ and the 
publishers think that it will offend you. 

2 Having said which, with the advent of the kinds of robust procedures we have used throughout this book, I’m 
not sure for how much longer people will use these tests. 

3 Non-parametric tests sometimes get referred to as distribution-free tests, with an explanation that they make no 
assumptions about the distribution of the data. Technically, this isn’t true: they do make distributional assumptions 
(e.g., the ones in this chapter all assume a continuous distribution), but they are less restrictive ones than their 
parametric counterparts. 
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15.3. Packages used in this chapter © 


Most of the tests used in this chapter are in the stats package, which is installed and loaded 
automatically. However, we will need the packages clinfun (for the Jonckheere test), pastecs 
(for descriptive statistics), pgirmess (for post hoc tests), ggplot2 (for graphs), and Rcmdr (R 
Commander) if you’re going to use that rather than commands (see section 3.6). If you 
don’t have these packages installed you’ll need to install them by executing: 

install.packagesC'clinfun"); install.packages("ggplot2"); install.packages 
("pastecs"); install.packages("pgirmess"); 

Then you need to load the packages by executing these commands: 

library(clinfun); library(ggplot2); library(pastecs); library(pgirmess) 


15.4. Comparing two independent conditions: 
the Wilcoxon rank-sum test © 


When you want to test differences between two conditions and different participants have 
been used in each condition then you have two choices: the Mann-Whitney test (Mann & 
Whitney, 1947) and the Wilcoxon’s rank-sum test (Wilcoxon, 1945; Figure 15.2). These 
tests are the non-parametric equivalent of the independent t-test. In fact both tests are 
equivalent, and there’s another, more famous, Wilcoxon test, so it gets extremely confus¬ 
ing for most of us. R does the Wilcoxon rank-sum test, but if you read about the Mann- 
Whitney test, it’s the same. (I’d prefer it if R did the Mann-Whitney test, that way we’d 
only have one Wilcoxon test to worry about. But that’s not the way it is, so we’ll have to 
get used to it.) 

For example, a neurologist might collect data to investigate the depressant effects of 
certain recreational drugs. She tested 20 clubbers in all: 10 were given an ecstasy tablet to 
take on a Saturday night and 10 were allowed to drink only alcohol. Levels of depression 
were measured using the Beck Depression Inventory (BDI) the day after and midweek. The 
data are in Table 15.1 and in the file Drug.dat. 


15.4.1. 


Theory of the Wilcoxon rank-sum test © 


The logic behind the Wilcoxon rank-sum test is incredibly elegant. First, let’s imagine a 
scenario in which there is no difference in depression levels between ecstasy and alcohol 
users. If we were to rank the data ignoring the group to which a person belonged from 
lowest to highest (i.e., give the lowest score a rank of 1 and the next lowest a rank of 2, 
etc.), then what should we find? Well, if there’s no difference between the groups then we 
would expect to find a similar number of high and low ranks in each group; specifically, 
if we added up the ranks, then we’d expect the summed total of ranks in each group to be 
about the same. Now think about what would happen if there was a difference between the 
groups. Let’s imagine that the ecstasy group is more depressed than the alcohol group. If 
we ranked the scores as before, then we would expect the higher ranks to be in the ecstasy 
group and the lower ranks to be in the alcohol group. Again, if we summed the ranks in 
each group, we’d expect the sum of ranks to be higher in the ecstasy group than in the 
alcohol group. 
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FIGURE 15.2 

Frank Wilcoxon 



Table 15.1 Data for drug experiment 




The Wilcoxon rank-sum test works on this principle. Let’s have a look at 
how ranking works in practice. Figure 15.3 shows the ranking process for 
both the Wednesday and Sunday data. To begin with, let’s use our data for 
Wednesday, because it’s more straightforward. First, just arrange the scores 
in ascending order, attach a label to remind you which group they came from 
(I’ve used A for alcohol and E for ecstasy), then assign potential ranks start¬ 
ing with 1 for the lowest score and going up to the number of scores you 
have. The reason why I’ve called these ‘potential’ ranks is that sometimes the 










Wednesday Data 
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FIGURE 15.3 

Ranking the depression scores for Wednesday 
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same score occurs more than once in a data set (e.g., in these data a score of 6 occurs twice, 
and a score of 35 occurs three times). These are called tied ranks and these values need 
to be given the same rank, so all we do is assign a rank that is the average of the potential 
ranks for those scores. So, with our two scores of 6, because they would’ve been ranked 3 
and 4, we take an average of these values (3.5) and use this value as a rank for both occur¬ 
rences of the score. Likewise, with the three scores of 35, we have potential ranks of 16, 
17 and 18; we actually use the average of these three ranks, (16 + 17 + 18)/3 = 17. When 
we’ve ranked the data, we add up all of the ranks for the two groups. So, add the ranks 
for the scores that came from the alcohol group (you should find the sum is 59) and then 
add the ranks for the scores that came from the ecstasy group (this value should be 151). 
We’re almost at the answer. 

For each of these values, we need to correct for the number of people in the group, by 
subtracting the mean rank of the group for a group of that many people (because otherwise 
larger groups would have larger ranks). The mean rank is the mean of the numbers from 
1 to 10: 

mean rank = 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8+ 9 + 10 

There’s a slightly easier formula, especially if you have a lot of numbers: 

, N(N + 1) 

meanrank = - 

2 

_ 10x11 
“ 2 
= 55 


We therefore calculate two potential values for W, one for each group: 

W = sum of ranks - mean rank 
W 1 =59-55 = 4 
W 2 =151-55 = 96 

Typically, we take the smallest of these values to be our test statistic, therefore the test 
statistic for the Wednesday data is W = 4. However, which of the two values of W is 
reported by R depends on which way around you input variables into the function, which 
is a little confusing, but don’t worry about it - it makes no difference to the significance. 



SELF-TEST 

^ Based on what you have just learnt, try ranking the 
Sunday data. (The answers are in Figure 15.3 - there are 
lots of tied ranks and the data are generally horrible.) 


You should find that when you’ve ranked the data, and added the ranks for the two 
groups, the sum of ranks for the alcohol group is 90.5 and for the ecstasy group it is 119.5. 
These are the sums of the ranks. We take the smaller of the two (90.5) and subtract 55, 
thereby obtaining a value of W of 35.5. 

Having computed the test statistic, R then calculates the associated p-value, which can 
be done in two ways. First, there is the exact approach, which is the best one to take. The 







CHAPTER 15 NON-PARAMETRIC TESTS 


659 


exact approach uses a Monte Carlo method to obtain the significance level. 4 This basically 
involves creating lots of data sets that match the sample, but instead of putting people into 
the correct groups, it puts them into a random group. Because the people were assigned to 
a group randomly, we know that the null hypothesis is true - so it calculates the value for 
W, based on these data in which the null hypothesis is true. Let’s think about this - if the 
null hypothesis is true, and the results of this analysis look like your analysis, well, that’s 
not so good for your hypothesis. However, R doesn’t just put the people into a random 
group and analyse them once, it then repeats it, and looks at the results again ... and again 
... and again. It does this thousands of times and looks at how often the difference that 
appears in the data when the null hypothesis is true is as large as the difference in your data. 

This method is great, because we don’t need to make any assumptions about the 
distribution, 5 but it’s not so great because it takes a long time; and as the sample size 
increases, the length of time it takes increases more and more. If your sample is big enough 
you might actually die before you get an answer. In addition, if you have ties in the data, 
you cannot use the exact method. 

With large sample sizes, you are better off using a normal approximation to calculate 
the p-value. The normal approximation doesn’t assume that the data are normal. Instead 
it assumes that the sampling distribution of the W statistic is normal, which means that 
a standard error can be computed that is used to calculate a z and hence a p-value. The 
default in R is to use a normal approximation if the sample size is larger than 40; and if you 
have ties, you have to use a normal approximation whether you like it or not. 

If you use a normal approximation to calculate the p-value, you also have the option to use 
a continuity correction. 6 The reason for the continuity correction is that we’re using a normal 
distribution, which is smooth, but a person can change in rank only by 1 (or 0.5, if there are 
ties), which is not smooth. Therefore, the p-value using the normal approximation is a lit¬ 
tle too small; the continuity correction attempts to rectify this problem but can make your 
p-value a little too high instead. The difference that the correction makes is pretty small - there 
is no consensus on the best thing to do. If you don’t specify, R will include the correction. 


15.4.2. 


Inputting data and provisional analysis © 



SELF-TEST 

s See whether you can use what you have learnt about 
data entry to enter the data in Table 15.1 into R. 


4 If you’re wondering why it’s called the Monte Carlo method, it’s because back in the late nineteenth century 
when Karl Pearson was trying to simulate data he didn’t have a computer to do it for him. So he used to toss 
coins. A lot. That is, until a friend suggested that roulette wheels, if unbiased, were excellent random number 
generators. Rather than trying to persuade the Royal Society to fund trips to Monte Carlo casinos to collect data 
from their roulette wheels, he purchased copies of he Monaco , a weekly Paris periodical that published exactly 
the data that he required, at the cost of 1 franc (Pearson, 1894; Plackett, 1983). When simulated data are used 
to test a statistical method, or to estimate a statistic, it is known as the Monte Carlo method even though we use 
computers now and not roulette wheels. 

5 Actually it does make an assumption, but it’s a good one: it assumes that the distribution in your sample looks 
exactly like the distribution in your sample. This assumption is, of course, true. It also explains why it has to do 
this analysis every time you run the test, because it’s different for every sample (unlike tests like the £-test, where 
it is assumed that the distribution is normal). 

6 If you’re reading this book out of order, the continuity correction is the same as the Yates correction that you 
came across in Chapter 18. Or if you’re reading it in order, that you will come across in Chapter 18. 
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When the data are collected using different participants in each group, we need to input 
the data using a factor variable. So, the data editor will have three columns of data. The 
first column is a coding variable (called something like drug), which, in this case, will have 
only two levels (ecstasy group and alcohol group). We can create this variable using the gl() 
function (see section 3.5.4.3), by executing: 

drug<-gl(2, 10, labels = c("Ecstasy", "Alcohol")) 

This command creates a variable called drug, which contains two blocks of 10 rows of 
data: the first block will be labelled Ecstasy and the second block Alcohol. 

The second column will have values for the dependent variable (BDI) measured the day after 
(call this variable sundayBDI) and the third will have the midweek scores on the same ques¬ 
tionnaire (call this variable wedsBDI). We can, therefore, create these variables by executing: 

sundayBDI<-c(15, 35, 16, 18, 19, 17, 27, 16, 13, 20, 16, 15, 20, 15, 16, 13, 
14, 19, 18, 18) 

wedsBDI<-c(28, 35, 35, 24, 39, 32, 27, 29, 36, 35, 5, 6, 30, 8, 9, 7, 6, 17, 
3, 10) 

Finally, we can tie these variables together in a dataframe called drugData by executing: 

drugData<-data.frame(drug, sundayBDI, wedsBDI) 

If you don’t want to do that, you’ll find the data in the file called Drug.dat, which you 
can load by executing: 

drugData<-read.delim("Drug.dat", header = TRUE) 

First, we would run some exploratory analyses on the data and because we’re going to 
be looking for group differences we need to run these exploratory analyses for each group. 



SELF-TEST 

s Carry out some analyses to test for normality and 
homogeneity of variance in these data (see sections 
5.6 and 5.7). 


The results of these exploratory analyses are shown in Outputs 15.1 and 15.2. Output 
15.1 shows that for the Sunday data the distribution for ecstasy, p < .05, appears to be non¬ 
normal whereas the alcohol data, W = 0.96 , ns, are normal; we can tell this by whether 
the significance of the Shapiro-Wilk test is less than .05 (and, therefore, significant) or 
greater than .05 (and, therefore, non-significant, ns). For the Wednesday data, although the 
data for ecstasy are normal, W = 0.94 , ns, the data for alcohol appear to be significantly 
non-normal, W = 0.75, p < .01. This finding would alert us to the fact that the sampling 
distribution might also be non-normal for the Sunday and Wednesday data and that a non- 
parametric test should be used. 

Output 15.2 shows the results of Levene’s test. For the Sunday data, F( 1, 18) = 3.64, 
ns, and for Wednesday, F(l, 18) = 0.51, ns, the variances are not significantly different, 
indicating that the assumption of homogeneity has been met. 

drug: Ecstasy 

Sunday_BDl Wednesday_BDl 
median 17.50000000 33.5000000 

mean 19.60000000 32.0000000 

SE .mean 2.08806130 1.5129074 

Cl.mean. 0.95 4.72352283 3.4224344 
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var 

std.dev 
coef.var 
skewness 
skew.2SE 
kurtosis 
kurt.2SE 
normtest.W 
normtest.p 


43.60000000 
6.60302961 
0.33688927 
1.23571300 
0.89929826 
0.26030385 
0.09754697 
0.81064005 
0.01952069 


22.8888889 
4.7842334 
0.1495073 
-0.2191665 
-0.1594999 
-1.4810114 
-0.5549982 
0.9411414 
0.5657834 


drug: Alcohol 

Sunday_BDI 

Wednesday_BDl 


median 

16.00000000 

7.500000000 


mean 

16.40000000 

10.100000000 


SE.mean 

0.71802197 

2.514181996 


Cl.mean.0. 

95 1.62427855 

5.687474812 


var 

5.15555556 

63.211111111 


std.dev 

2.27058485 

7.950541561 


coef.var 

0.13845030 

0.787182333 


skewness 

0.11686189 

1.500374383 


skew.2SE 

0.08504701 

1.091907319 


kurtosis 

-1.49015904 

1.079109997 


kurt.2SE 

-0.55842624 

0.404388605 


normtest.W 0.95946594 

0.753466710 


normtest.j 

Output 15. 

) 0.77976592 

,1 

0.003933045 


Levene's Test for Homogeneity of Variance (center = 
Df F value Pr(>F) 

"mean") 

group 1 

18 

3.6436 0.07236 



Signif. codes: 0 '***' 0 

.001 ’ **■ 0.01 ■*■ 0.05 ■ . 

1 0.1 ’ 1 1 

Levene's Test for Homogeneity of Variance (center = 
Df F value Pr(>F) 
group 1 0.5081 0.4851 

18 

Output 15.2 

«mean») 

15.4.3. 

Running the analysis using R Commander © 


As always, import the data, using Data=>Import data=>from text file, clipboard, or URL... 
(see section 3.7.3) click on L_2. K J and choose the file Drug.dat. 

To run the Wilcoxon test on independent samples, select Statistics=>Nonparametric 
tests=> Two-sample Wilcoxon test... to activate the dialog box in Figure 15.4. In the box 
on the left, labelled Groups (pick one), select the variable that defines the groups that 
you want to compare; this variable must be a factor with two levels. In our case we want 
to select the variable drug. On the right, in the list labelled Response Variable (pick one), 
choose the outcome variable on which you want to compare groups. In this case, we’ll pick 
sundayBDI first. 
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FIGURE 15.4 

The non-parametric 
tests menu in R 
Commander and 
the dialog box for 
the Wilcoxon test 
for independent 
samples 
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The Wilcoxon test offers two main different ways to calculate a p-value. The first option 
is the default. The default depends on the sample size and the presence of ties. If the sample 
size is 40 or fewer, then the default will be to do an exact test, as long as there are no ties. 
If the sample is larger than 40, the default will be to use a normal approximation with 
continuity correction. You can override this default if you like, but remember that if you 
have ties the exact test won’t work, and if you have a large sample the exact test may not 
finish before your funeral. 

You should probably leave the default option of a two-sided test as it is (although if you 
have predicted a direction of the effect you could choose to test whether or not the differ¬ 
ence will be bigger {Difference > 0) or smaller {Difference < 0) than zero. When you have 
selected your variables, click on 1 0K 1 to run the analysis. The output will be discussed in 
due course. 


15.4.4. 


Running the analysis using R © 


The function for the Wilcoxon test is called wilcox.test() and works in a very similar way to 
the t.test() function (see section 9.5.2). That is, there are two different ways that you can 
use this function and it depends on whether your group data are in a single column or if 
they are in two different columns. 

If you have the data for different groups stored in a single column, then the wilcox.test() 
function is used like the lm() function (in other words, like a regression): 

newModel<-wilcox.test(outcome ~ predictor, data = dataFrame, paired = FALSE/ 
TRUE) 

in which: 

• newModel is an object created that contains information about the model. We can get 
summary statistics for this model by executing the name of the model. 
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• outcome is a variable that contains the scores for the outcome measure (in this case drug). 

• predictor is a variable that tells us to which group a score belongs (in this case sun- 
dayBDI or wedsBDI). 

• dataFrame is the name of the dataframe containing the aforementioned variables. 

• paired = FALSE determines whether or not you want to do the Wilcoxon test 
on matched (in which case include paired = TRUE) or independent samples (in 
which case exclude the option because this is the default or include paired = 
FALSE). 

However, if you have the data for different groups stored in two columns, then the wil- 
cox.test() function takes this form: 

newModelc-wilcox.test(scores group 1, scores group 2, paired = FALSE/TRUE) 
in which the options are the same as before except that: 

• scores group 1 is a variable that contains the scores for the first group. 

• scores group 2 is a variable that contains the scores for the second group. 

In both forms of the function, there are additional options that can be specified (but do 
not need to be). These are: 

• alternative = c(“two.sided”/“less”/“greater”): This option determines whether you’re 
doing a two-tailed test, which is the default and happens if we don’t include this 
option. If you want to do a one-tailed test then you need to include the option alter¬ 
native = “less” (if you predict that the difference between means will be less than 
zero) or alternative = “greater” (if you predict that the difference between means will 
be greater than zero). 

• mu = 0: A difference between groups of zero is the default null hypothesis, but can 
be changed. For example, including mu = 5 would test the null hypothesis that the 
difference between groups is different to 5. 

• exact: By default the function does an exact test ( exact = TRUE). You can switch this 
option off by including exact = FALSE. 

• correct: By default the function does a continuity correction ( correct = TRUE)-, but if 
you don’t want one include correct = FALSE. 

• conf.level = 0.95: This determines the alpha level for the p-value and confidence 
intervals. By default it is 0.95 (for 95% confidence intervals), but if you want to use 
a different value, say 99%, you could include conf.level = 0.99. 

• na.action: If you have complete data (as we have here) you can exclude this option, 
but if you have missing values (i.e., NAs in the dataframe) then it can be useful to 
use na.action = na.exclude, which will exclude all cases with missing values - see R’s 
Souls’ Tip 7.1). 

Therefore, to compute a basic Wilcoxon test for our Sunday data we could execute: 

sunModelc-wilcox. test(sundayBDI ~ drug, data = drugData) 
sunModel 

For the Wednesday data we need only change the name of the outcome variable: 

wedModelc-wilcox.test(wedsBDI ~ drug, data = drugData) 
wedModel 
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These commands create models called sunModel and wedModel that predict Sunday and 
Wednesday depression levels from group membership (drug). We execute the name of 
the model to see the output. 7 Having left all of the default options as they are, R will 
calculate the p-value, using the exact approach if N is less than 40 and there are no ties, 
or the normal approximation approach if N is more than 40 or if there are any ties. It will 
also use a continuity correction. To use a normal approximation rather than exact p, and 
to get rid of the continuity correction we can exclude exact = FALSE and correct = FALSE 
respectively: 

sunModel<-wilcox.test(sundayBDI ~ drug, data = drugData, exact = FALSE, 
correct= FALSE) 

wedModel<-wilcox.test(wedsBDI ~ drug, data = drugData, exact = FALSE, 
correct= FALSE) 


15.4.5. 


Output from the Wilcoxon rank-sum test © 


The output from the Wilcoxon tests is shown in Output 15.3 (Sunday) and Output 15.4 
(Wednesday). For the BDI score on Sunday, you will find the p-value is 0.286 with the 
continuity correction (the default — if you rerun the test without this correction, you’ll 
find the p-value is 0.269). We could say that the type of drug did not significantly affect 
depression levels the day after, W = 35.5, p = .286. 

Wilcoxon rank sum test with continuity correction 

data: sundayBDI by drug 

W = 35.5, p-value = 0.2861 

alternative hypothesis: true location shift is not equal to 0 

Output 15.3 

Wilcoxon rank sum test with continuity correction 

data: wedsBDl by drug 

W = 4, p-value = 0.000569 

alternative hypothesis: true location shift is not equal to 0 

Output 15.4 

For the Wednesday data, however, the type of drug did significantly affect depression 
levels the day after, W= 4, p < .001. Note that because we left the default of an exact test, 
R gives us a warning message that it cannot do this, because of the ties. 


15.4.6. 


Calculating an effect size © 


As we’ve seen throughout this book, it’s important to report effect sizes so that people have 
a standardized measure of the size of the effect you observed, which they can compare to 
other studies. R doesn’t calculate an effect size for us, but we can calculate approximate 
effect sizes fairly easily. First, we take the p-value. Recall that R used a normal approxima¬ 
tion to calculate the p-value; it did this via calculating a z for the data. It doesn’t report, or 
store, the z-value, but we can recover it from the p-value using the qnorm() function. We 

7 We could run the commands without creating a model to get the output in a single command, but having the 
models is useful later for computing effect sizes. 
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JANE SUPERBRAIN 15.1 

Doing it from scratch (D 

The Wilcoxon rank-sum test is a good example of some¬ 
thing you could easily program yourself, if there wasn’t 
already a function in R. In addition, you can get some 
useful information, and learn a little about R. Here’s the 
code to do the Wilcoxon test: 

gl <- drugData$sundayBDI[drugData$drug == 
"Alcohol"] 

g2 <- drugData$sundayBDI[drugData$drug == 
"Ecstasy"] 

nl <- length(gl); n2 <- length(g2) 
w <- rankCcCgl, g2)) 
rl <- w[l:nl]; r2 <- w[(nl+l) ;(nl+n2)] 
wl <- sum(rl); w2 <- sum(r2) 


wild <- wl-nl*(ril+l)/2; wilc2 
<- w2-n2*(n2+l)/2 
wile = min(wilcl, wilc2) 
wile 

ml <- mean(rl); m2 <- mean(r2) 
ml; m2 

First, we create gi and g2 These are the BDI scores 
on Sunday for the alcohol group (gl) and ecstasy group 
(g2). We count the number of people in each group, 
using the lengthQ function, and call these values nl 
and n2. 

Then we put gl and g2 back together, using c(g1, 
g2), into one long variable, which we convert to ranks, 
with the rank() function. 

We get the ranks out again, and put these into rl and 
r2. The ranks for group 1 are the numbers from 1 to the 
number of people in group 1 (10). The ranks for group 2 
are the number in group 1, plus 1 (11), to the number in 
both groups (20). We find the sums of these ranks, with 
the sum() function, and correct for the number of people; 
we call these wild and wilc2. The Wilcoxon W is the 
smaller of these two. In addition, we calculate the mean 
rank for each group, which can be a useful descriptive 
statistic. These are given as ml and m2. 


can then convert the z-value into an effect size estimate. The equation to convert a z-score 
into the effect size estimate, r, is as follows (from Rosenthal, 1991, p. 19): 


z 



in which z is the z-score and N is the size of the study (i.e., the number of total observa¬ 
tions) on which z is based. 

We can write ourselves a function (or access the function directly from our DSUR 
package - see section 3.4.5) to get the effect size from the models we created earlier. The 
function looks like this: 

rFromWilcox<-function(wilcoxModel, N){ 
z<- qnorm(wilcoxModel$p.value/2) 

r<- z/ sqrt(N) 

cat(wilcoxModel$data.name, “Effect Size, r = “, r) 

} 

Executing these commands creates a function called rFromWilcox(), which takes a model 
computed using wilcox.test() and the total sample size (N) as input. The first command 
within the function calculates the value of z using the qnorm() function. The p-value for 
a wilcox.test() model is stored in an object with the name p.value, so we can refer to it 
directly by appending $p.value to the name of the model. Therefore, the command takes 
the p-value associated with the model entered into the function, divides it by 2 so that 
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we’re looking at only one end of the normal distribution, and then applies qnorm to it 
(which gives us the z associated with that value of p ). The second command computes r 
using the equation above by dividing z (which we’ve just computed) by the square root 
of N (which we know because it is entered into the function). The final command prints 
to the console the object data.name from the original model (this tells us what the model 
represents) then a text string that tells us what the output shows, and then the value of r 
computed in the previous command. 

For the current example, we could apply this function by executing: 

rFromWilcox(sunModel, 20) 
rFromWilcox(wedModel, 20) 

In both cases we enter the model name and the total sample size. The resulting output 
shows us that the r values are -0.25 for Sunday and -0.78 for Wednesday: 

sundayBDI by drug Effect Size, r = -0.2470529 

wedsBDI by drug Effect Size, r = -0.7790076 

This represents a small to medium effect for the Sunday data (it is below the .3 criterion 
for a medium effect size) and a huge effect for the Wednesday data (the effect size is well 
above the .5 threshold for a large effect). The Sunday data show how a moderately large 
effect size can still be non-significant in a small sample. 


15.4.7. 


Writing the results © 


For the Wilcoxon rank-sum test, we need to report only the test statistic (which is denoted 
by W) and its significance. Of course, we really ought to include the effect size as well. So, 
we could report something like: 

Depression levels in ecstasy users (Mdn = 17.50) did not differ significantly from 
alcohol users (Mdn = 16.00) the day after the drugs were taken, W = 35.5, p = 0.286, 
r = —.25. Flowever, by Wednesday, ecstasy users {Mdn = 33.50) were significantly 
more depressed than alcohol users (Mdn = 7.50), W = 4, p < .001, r = —.78. 

Note that I’ve reported the median for each condition - this statistic is more appropriate 
than the mean for non-parametric tests. 



CRAMMING SAM’S TIPS 


Some important terms 


• The Wilcoxon rank-sum test compares two conditions when different participants take part in each condition and the result¬ 
ing data violate any assumption of the independent t- test. 

• Look at the p-value. If the value is less than .05 then the two groups are significantly different. 

• You might want to calculate the mean rank. 

• Report the l/l/-statistic and the significance value. Also report the medians and their corresponding ranges (or draw a 
boxplot). 

• You should calculate the effect size and report this too. 
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|ANE SUPERBRAIN 15.2 

Non-parametric tests and statistical power © 

Ranking the data is a useful way around the distribu¬ 
tional assumptions of parametric tests, but there is a 
price to pay: by ranking the data we lose some informa¬ 
tion about the magnitude of differences between scores. 
Consequently, non-parametric tests can be less power¬ 
ful than their parametric counterparts. Statistical power 
(section 2.6.5) refers to the ability of a test to find an 
effect that genuinely exists. So, by saying that non-para¬ 
metric tests are less powerful, we mean that if there is a 
genuine effect in our data then a parametric test is more 
likely to detect it than a non-parametric one. However, 
this statement is true only if the assumptions of the para¬ 
metric test are met. So, if we use a parametric test and 


a non-parametric test on the same data, and those data 
meet the appropriate assumptions, then the parametric 
test will have greater power to detect the effect than the 
non-parametric test. 

The problem is that to define the power of a test we 
need to be sure that it controls the Type I error rate (the 
number of times a test will find a significant effect when 
in reality there is no effect to find - see section 2.6.2). We 
saw in Chapter 2 that this error rate is normally set at 5%. 
We know that when the sampling distribution is normally 
distributed then the Type I error rate of tests based on 
this distribution is indeed 5%, and so we can work out 
the power. However, when data are not normal the Type I 
error rate of tests based on this distribution won’t be 5% 
(in fact we don't know what it is for sure as it will depend 
on the shape of the distribution) and so we have no way 
of calculating power (because power is linked to the Type 
I error rate - see section 2.6.5). So, although you often 
hear (in the first edition of my SPSS book, for example!) 
of non-parametric tests having an increased chance of 
a Type II error (i.e., more chance of accepting that there 
is no difference between groups when, in reality, a differ¬ 
ence exists), this is true only if the sampling distribution is 
normally distributed. 


15.5. Comparing two related conditions: the 
Wilcoxon signed-rank test © 


The Wilcoxon signed-rank test (Wilcoxon, 1945), not to be confused with the rank-sum 
test in the previous section, is used in situations in which there are two sets of scores to 
compare, but these scores come from the same participants. As such, think of it as the non- 
parametric equivalent of the dependent t-test. 

Imagine the experimenter in the previous section was now interested in the change in 
depression levels, within people, for each of the two drugs. We now want to compare 
the BDI scores on Sunday to those on Wednesday. When testing the differences between 
related scores, we assume normality of the differences (see Chapter 9). Let’s first test this 
assumption. 



SELF-TEST 

s Compute the change in BDI scores from Sunday to 
Wednesday and then compute normality tests for 
this change score separately for the alcohol and 
ecstasy groups. 
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Output 15.5 shows the results of a descriptive analysis. For the alcohol group we have a 
non-normal distribution, W = 0.83, p < .05, in the change scores, but for the ecstasy group 
the difference scores are approximately normal, W = 0.91, p = .273. Therefore, we need 
to use a non-parametric test for the alcohol group (although we’ll do one for the ecstasy 
group too, just to get some practice). 

drugData$drug: Alcohol 

median mean SE.mean Cl.mean var std.dev coef.var 

-7.500 -6.300 2.098 4.746 44.011 6.634 -1.053 


skewness skew.2SE kurtosis kurt.2SE 
1.239 0.902 0.987 0.370 


normtest.W normtest.p 
0.828 0.032 


drugData$drug: Ecstasy 

median mean SE.mean Cl.mean var 

14.000 12.400 2.531 5.724 64.044 

skewness skew.2SE kurtosis kurt.2SE 
-0.414 -0.301 -1.369 -0.513 


std.dev coef.var 
8.002 0.645 

normtest.W normtest.p 
0.909 0.273 


Output 15.5 


15.5.1. 


Theory of the Wilcoxon signed-rank test (D 



The Wilcoxon signed-rank test works in a fairly similar way to the dependent t-test 
(Chapter 9) in that it is based on the differences between scores in the two conditions 
you’re comparing. Once these differences have been calculated they are ranked (just like in 
section 15.4.1) but the sign of the difference (positive or negative) is assigned to the rank. 
If we use the same data as before, we can compare depression scores on Sunday to those 
on Wednesday for the two drugs separately. 

Table 15.2 shows the ranking for these data. Remember that we’re ranking the two 
drugs separately. First, we calculate the difference between Sunday and Wednesday (that’s 
just Sunday’s score subtracted from Wednesday’s). If the difference is zero (i.e., the scores 
are the same on Sunday and Wednesday) then we exclude these data from the ranking. 
We make a note of the sign of the difference (positive or negative) and then rank the dif¬ 
ferences (starting with the smallest) ignoring whether they are positive or negative. The 
ranking is the same as in section 15.4.1, and we deal with tied scores in exactly the same 
way. Finally, we collect together the ranks that came from a positive difference between 
the conditions, and add them up to get the sum of positive ranks (T ). We also add up the 
ranks that came from negative differences between the conditions to get the sum of nega¬ 
tive ranks (T_). So, for ecstasy, T = 36 and T_ = 0 (in fact there were no negative ranks), 
and for alcohol, T = 8 and T_ = 47. The test statistic, T, is the smaller of the two values, 
and so is 0 for ecstasy and 8 for alcohol. 

To calculate the significance of the test statistic (T), we again look at the mean (T) and 
standard error (SE ^), which, like the rank-sum test in the previous section, are functions 
of the sample size, n (because we used the same participants, there is only one sample size): 


j _ n(n + 1) 
4 


SEj = 


n(n + l)(2n +1) 


24 
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Table 15.2 Ranking data in the Wilcoxon signed-rank test 



In both groups, n is simply 10 (because that’s how many participants were used). 
However, remember that for our ecstasy group we excluded two people because they had 
differences of zero, therefore the sample size we use is 8, not 10. This gives us: 


f _ 8(8 + 1) 

1 Ecstasy ^ 10 


SE~ 


Ecstasy 


8(8 + l)(16 + l) 


24 


7.14 


For the alcohol group there were no exclusions so we get: 


f _ 10(10 + 1) _ 

1 Alcohol “ , -Z./.0U 


SE~ 

1 A1 


10 ( 10 + 1)(20 + 1 ) 


24 


9.81 
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As before, if we know the test statistic, the mean of test statistics and the standard error, 
then we can easily convert the test statistic to a z-score using the equation that we came 
across way back in Chapter 1: 

X-X T-T 

z = -=- 

s SEf 

If we calculate this value for the ecstasy and alcohol depression scores we get: 



T-T 0-18 

ZEcstasy SE t 7.14 

T-T 8-27.5 
Alcoho1 SE f 9.81 

If these values are bigger than 1.96 (ignoring the minus sign) then the test is significant at 
p < .05. So, it looks as though there is a significant difference between depression scores 
on Wednesday and Sunday for both ecstasy and alcohol. 


15.5.2. 


Running the analysis with R Commander © 


To do the same analysis using R Commander we can use the same dataframe as before, but 
because we want to look at the change for each drug separately, we need to use the subset 
command and ask R to split the file by the variable drug. This process ensures that any 
subsequent analysis is done for the ecstasy group and the alcohol group separately. To do 
this in R Commander, select Data=>Active data set=>Subset active data set... to open the 
dialog box shown in Figure 15.5. We want to keep all of the variables, so we leave the box 
at the top checked. 

For the Subset expression, remember that you need to use a double equals sign for ‘is 
equal to’ in R, so we write “ drug== Alcohol”. Finally, we give the data set a new name; 
we’ll call it alcoholData. Click on 1 ok I to create the dataframe. Repeat the process to 
create a dataframe called ecstasyData that contains only the data from the ecstasy group. 

Now we have the data prepared, we can use R Commander to run the Wilcoxon 
signed-rank test. Make sure you have the alcoholData data set as the active data set in 
R Commander, and select Statistics => Nonparametric tests => Paired-samples Wilcoxon 
test... to open the dialog box in Figure 15.6. 


FIGURE 15.5 

R Commander 
menu and dialog 
box for subset 
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FIGURE 15.6 

Dialog box for 
the Wilcoxon 
signed-rank test 


Pick the two variables that you would like to compare: in our case, there are only two 
to select from, so choose one in the left-hand box, and the other in the right-hand box (it 
doesn’t matter which way around you do it). 

You can choose a p-value calculation method. As with the rank-sum test (section 15.4.3), 
you can leave this as the default, in which case R will use the exact method to calculate 
the p-value if your sample size is less than 40, and the normal approximation if larger than 
40. You can choose the exact method, which is often better than the normal approxima¬ 
tion, but cannot be used if you have ties in the data, and can be slow if your sample size is 
large. If you choose the normal approximation method, you can do this with or without 
the continuity correction. If you choose the default, the continuity correction will be used. 
We will select the normal approximation (we have ties, so we cannot use the exact method 
anyway) and we will choose not to use the continuity correction. 

You can leave the default option of a two-sided test as it is (although if you have pre¬ 
dicted a direction of the effect you could choose to test whether or not the difference will 
be bigger {Difference > 0 ) or smaller {Difference < 0) than zero. When you have selected 
your variables, click on 1 ok j to run the analysis. The output will be discussed very soon. 


15.5.3. 


Running the analysis using R © 


We want to run our analysis on the alcohol and ecstasy groups separately; therefore, our 
first job is to split the dataframe into two. 



SELF-TEST 

s Use the subset() function to create separate 

dataframes for the different drugs called alcoholData 
and ecstasyData. 


If you completed the self-test then you should now have two dataframes called alcohol 
Data and ecstasyData. We can again use the wilcox.test() function, but this time because 
our data are stored in different columns (wedsBDI and sundayBDI) we need to enter the 
names of the two variables we want to compare rather than a formula, and we need to 
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include the option paired = TRUE to tell R that the data are paired. (If we don’t include 
this option R will do a Wilcoxon rank-sum test.) For these examples, we’re also going 
to include the option correct = FALSE, because we do not want a continuity correction. 
Therefore, to run the analysis for the alcohol group execute: 

alcoholModel<-wilcox.test(alcoholData$wedsBDI, alcoholData$sundayBDI, paired = 

TRUE, correct^ FALSE) 

alcoholModel 

and for the ecstasy group: 

ecstasyModel<-wilcox.test(ecstasyData$wedsBDI, ecstasyData$sundayBDI, paired = 

TRUE, correct^ FALSE) 

ecstasyModel 

In both cases we create a model ( alcoholModel and ecstasyModel) based on a Wilcoxon test 
between the sundayBDI and wedsBDI variables. 


15.5.4. 


Wilcoxon signed-rank test output © 



Output 15.6 shows the output for the alcohol group. You will get a warning when 
the function runs, telling you that it couldn’t do an exact test, because there are 
ties. We didn’t ask for an exact test, but when the sample size is less than 40 R tries 
to do one anyway. It reports the value of T (which it calls V) 8 and that this value is 
significant at p = .047. Therefore, we should conclude (based on the medians) that 
when taking alcohol there was a significant decline in depression (as measured by 
the BDI) from the morning after to midweek (p = .047). 

Output 15.7 shows the results for the ecstasy group. We should conclude that 
when taking ecstasy there was a significant increase in depression (as measured by 
the BDI) from the morning after to midweek, p = .012). 

Wilcoxon signed rank test 
data: alcoholData$wedsBDl and alcoholData$sundayBDl 

V = 8, p-value = 0.04657 

alternative hypothesis: true location shift is not equal to 0 

Output 15.6 

Wilcoxon signed rank test 

data: ecstacy$bdi.Wednesday and ecstacy$bdi.Sunday 

V = 36, p-value = 0.01151 

alternative hypothesis: true location shift is not equal to 0 

Output 15.7 

From the results of the two different groups, we can see that there is an opposite effect 
when alcohol is taken to that when ecstasy is taken. Alcohol makes you slightly depressed 
the morning after, but this depression has dropped by midweek. Ecstasy also causes some 
depression the morning after consumption; however, this depression increases towards the 


8 The order in which you input variables into the function will affect the value of V because it is T + and whether 
ranks are positive or negative depends on which way around you subtract scores. In Table 15.2 we subtracted 
Sunday BDI scores from those on the Wednesday and T + = 36 and T_ = 0 for the ecstasy group, but if we had 
subtracted Wednesday scores from Sunday scores these would have been the opposite way around (T = 0 and T_ = 
36). So the order that you put variables into the functions affects the value of T + (and, therefore, V)- You don’t 
need to worry about this little quirk: the p-value will be the same whichever way around you specify the variables. 
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middle of the week. Of course, to see the true effect of the morning after we would have 
had to take measures of depression before the drugs were administered. This opposite 
effect between groups of people is known as an interaction (i.e., you get one effect under 
certain circumstances and a different effect under other circumstances) and we came across 
this in Chapters 12-14. 


15.5.5. 


Calculating an effect size (D 


The effect size can be calculated in the same way as for the Wilcoxon rank-sum test (see the 
equation in section 15.4.6); therefore, we can reuse the rFromWilcox() function. In both 
the alcohol and ecstasy groups we had 20 observations (although we only used 10 people 
and tested them twice, it is the number of observations, not the number of people, that is 
important here). Therefore, we can get the effect sizes by inputting the model names and 
the number of observations into the function: 

rFromWilcox(alcoholModel, 20) 
rFromWilcox(ecstasyModel, 20) 

The resulting output is: 

alcoholData$wedsBDI and alcoholData$sundayBDI Effect Size, r = -0.4450246 

ecstasyData$wedsBDl and ecstasyData$sundayBDl Effect Size, r = -0.5649883 

For the alcohol group we find a medium to large change in depression when alcohol 
is taken, r = -.45, which is between Cohen’s criteria of .3 and .5 for a medium and large 
effect, respectively. For the ecstasy group, r — - .56, which represents a large change in 
levels of depression when ecstasy is taken (it is above Cohen’s benchmark of .5). 


15.5.6. 


Writing the results © 


For the Wilcoxon test, we need only report the significance of the test and preferably an 
effect size. So, we could report something like: 

* For ecstasy users, depression levels were significantly higher on Wednesday (Mdn = 
33.50) than on Sunday (Mdn = 17.50), p = .047, r = —.56. However, for alcohol users 
the opposite was true: depression levels were significantly lower on Wednesday (Mdn 
— 7.50) than on Sunday (Mdn = 16.0), p = .012, r = —.45. 



CRAMMING SAM’S TIPS 


The Wilcoxon signed-rank test 


• The Wilcoxon signed-rank test compares two conditions when the same participants take part in each condition and the 
resulting data violate an assumption of the dependent f-test. 

• Look at the p-value. If the value is less than .05 then the two groups are significantly different. 

• Report the significance value of the test and an effect size if possible. Also report the medians and their corresponding 
ranges (or draw a boxplot). 
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Labcoat Leni’s Real Research 15.1 


Having a quail of a time? 0 


Matthews, R. C., et al. (2007). Psychological Science, 18(9), 758-762. 


We encountered some research in Chapter 2 in which we discovered that you can influence aspects of male 
quail sperm production through ‘conditioning’. The basic idea is that the male is granted access to a female for 
copulation in a certain chamber (e.g., one that is coloured green) but gains no access to a female in a different 
context (e.g., a chamber with a tilted floor). The male, therefore, learns that when he is in the green chamber 
his luck is in, but if the floor is tilted then frustration awaits. For other males the chambers will be reversed (i.e., 
they get sex only when in the chamber with the tilted floor). The human equivalent (well, sort of) would be if you 
always managed to pull in the Pussycat Club but never in the Honey Club. 9 During the test phase, males get to 
mate in both chambers. The question is: after the males have learnt that they will get a mating opportunity in a 
certain context, do they produce more sperm or better-quality sperm when mating in that context compared to the 
control context? (That is, are you more of a stud in the Pussycat Club? OK, I’m going to stop this analogy now.) 

Mike Domjan and his colleagues predicted that if conditioning evolved because it increases reproductive fit¬ 
ness then males who mated in the context that had previously signalled a mating opportunity would fertilize a sig¬ 
nificantly greater number of eggs than quails that mated in their control context (Matthews, Domjan, Ramsey, & 
Crews, 2007). They put this hypothesis to the test in an experiment that is utter genius. After training, they allowed 
14 females to copulate with two males (counterbalanced): one male copulated with the female in the chamber 
that had previously signalled a reproductive opportunity (Signalled), whereas the second male copulated with 
the same female but in the chamber that had not previously signalled a mating opportunity (Control) . Eggs were 
collected from the females for 10 days after the mating and a genetic analysis was used to determine the father 
of any fertilized eggs. 

The data from this study are in the file Matthews et al. (2007).dat. Labcoat Leni wants you to carry out a 

a Wilcoxon signed-rank test to see whether more eggs were fertilized by males mating in their signalled 
context compared to males in their control context. 

Answers are in the additional material on the companion website (or look at page 760 in the original 
article). 


15.6. Differences between several independent 
groups: the Kruskal-Wallis test 0 


In Chapter 10 we discovered a technique called one-way independent ANOVA that could 
be used to test for differences between several independent groups. I mentioned several 
times in that chapter that the F-statistic can be robust to violations of its assumptions (sec¬ 
tion 10.3). We also saw that there are measures that can be taken when you have hetero¬ 
geneity of variance (Jane Superbrain Box 10.2). However, there is another alternative: the 
one-way independent ANOVA has a non-parametric counterpart called the Kruskal-Wallis 
test (Kruskal & Wallis, 1952). If you have data that have violated an assumption then 
this test can be a useful way around the problem. If you’d like to know a bit more about 
William Kruskal (Figure 15.7) then there is a lovely biography by Fienberg, Stigler, and 
Tanur (2007). 

I read a story in a newspaper claiming that scientists had discovered that the chemical 
genistein, which occurs naturally in soya, was linked to lowered sperm counts in Western 


9 These are both clubs in Brighton that I’ve never been to because I don’t like that sort of thing. 
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FIGURE 15.7 

William Kruskal 


males. In fact, when you read the actual study, it had been conducted on rats, it found no 
link to lowered sperm counts, but there was evidence of abnormal sexual development 
in male rats (probably because this chemical acts like oestrogen). The journalist naturally 
interpreted this as a clear link to apparently declining sperm counts in Western males 
(never trust what you read in the newspapers). Anyway, as a vegetarian who eats lots of 
soya products and probably would like to have kids one day, I might want to test this idea 
in humans rather than rats. I took 80 males and split them into four groups that varied in 
the number of soya meals they ate per week over a year-long period. The first group was 
a control group and had no soya meals at all per week (i.e., none in the whole year); the 
second group had one soya meal per week (that’s 52 over the year); the third group had 
four soya meals per week (that’s 208 over the year); and the final group had seven soya 
meals a week (that’s 364 over the year). At the end of the year, all of the participants were 
sent away to produce some sperm that I could count (when I say T, I mean someone else 
in a laboratory as far away from me as humanly possible). 10 


15.6.1. 


Theory of the Kruskal-Wallis test © 


The theory for the Kruskal-Wallis test is very similar to that of the Mann-Whitney (and 
Wilcoxon rank-sum) test, so before reading on look back at section 15.4.1. Like the 
Wilcoxon rank-sum test, the Kruskal-Wallis test is based on ranked data. So, to begin with, 
you simply order the scores from lowest to highest, ignoring the group to which the score 
belongs, and then assign the lowest score a rank of 1, the next highest a rank of 2 and so 



10 In case any medics are reading this chapter, these data are made up and, because I have absolutely no idea what 
a typical sperm count is, they’re probably ridiculous. I apologize and you can laugh at my ignorance. 
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Table 15.3 Data for the soya example with ranks 


No Soya 

Sperm 

(millions) 

Rank 

1 Soya Meal 

Sperm 

(millions) Rank 

4 Soya Meals 

Sperm 

(millions) Rank 

7 Soya Meals 
Sperm 

(millions) Rank 

0.35 

4 

0.33 

3 

0.40 

6 

0.31 

1 

0.58 

9 

0.36 

5 

0.60 

10 

0.32 

2 

0.88 

17 

0.63 

11 

0.96 

19 

0.56 

7 

0.92 

18 

0.64 

12 

1.20 

21 

0.57 

8 

1.22 

22 

0.77 

14 

1.31 

24 

0.71 

13 

1.51 

30 

1.53 

32 

1.35 

27 

0.81 

15 

1.52 

31 

1.62 

34 

1.68 

35 

0.87 

16 

1.57 

33 

1.71 

36 

1.83 

37 

1.18 

20 

2.43 

41 

1.94 

38 

2.10 

40 

1.25 

23 

2.79 

46 

2.48 

42 

2.93 

48 

1.33 

25 

3.40 

55 

2.71 

44 

2.96 

49 

1.34 

26 

4.52 

59 

4.12 

57 

3.00 

50 

1.49 

28 

4.72 

60 

5.65 

61 

3.09 

52 

1.50 

29 

6.90 

65 

6.76 

64 

3.36 

54 

2.09 

39 

7.58 

68 

7.08 

66 

4.34 

58 

2.70 

43 

7.78 

69 

7.26 

67 

5.81 

62 

2.75 

45 

9.62 

72 

7.92 

70 

5.94 

63 

2.83 

47 

10.05 

73 

8.04 

71 

10.16 

74 

3.07 

51 

10.32 

75 

12.10 

77 

10.98 

76 

3.28 

53 

21.08 

80 

18.47 

79 

18.21 

78 

4.11 

56 

Total (R) 

927 


883 


883 


547 


on (see section 15.4.1 for more detail). When you’ve ranked the data you collect the scores 
back into their groups and simply add up the ranks for each group. The sum of ranks for 
each group is denoted by R (where i is used to denote the particular group). Table 15.3 
shows the raw data for this example along with the ranks. 



SELF-TEST 

s Have a go at ranking the data and see if you get the 
same results as me. 


Once the sum of ranks has been calculated for each group, the test statistic, H, is calcu¬ 
lated as: 


12 


k R- 

h = y —L 

N(N + l) i=t n- 


3(N +1) 


(15.1) 
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In this equation, R is the sum of ranks for each group, N is the total sample size (in this 
case 80) and n is the sample size of a particular group (in this case we have equal sample 
sizes and they are all 20). Therefore, all we really need to do for each group is square the 
sum of ranks and divide this value by the sample size for that group. We then add up these 
values. That deals with the middle part of the equation; the rest of it involves calculating 
various values based on the total sample size. For these data we get: 


H = 


12 


80(81) 
12 


927 883 883 z 547 

+-+-+ 


2 ) 


20 


20 


20 


20 


-3(81) 


(42966.45 + 38984.45 + 38384.45 +14960.45) - 243 

6480 

= 0.0019(135895.8)-243 
= 251.66-243 


= 8.659 

This test statistic has a special kind of distribution known as the chi-square distribution (see 
Chapter 18) and for this distribution there is one value for the degrees of freedom, which 
is one less than the number of groups (k — 1), in this case 3. 



15.6.2. 


Inputting data and provisional analysis © 



SELF-TEST 

s See whether you can enter the data in Table 15.3 into 
R (you don't need to enter the ranks). Then conduct 
some exploratory analyses on the data (see sections 
5.6 and 5.7). 



When the data are collected using different participants in each group, we input the data 
using a coding variable. So, the data editor will have two columns of data. The first column 
is a factor (called something like Soya), which, in this case, will have four levels. We can 
create this variable using the gl() function by executing: 

Soya<-gl(4, 20, labels = c("No Soya", "1 Soya Meal", "4 Soya Meals", "7 Soya 
Meals")) 


This command creates a variable called Soya, which contains four blocks of 20 rows of 
data; the first block will be labelled No Soya, the second block 1 Soya Meal, and so on. The 
second variable will have values for the dependent variable (sperm count) measured at the 
end of the year (call this variable Sperm) - see the online materials for a fuller description. 
Finally, we can tie these variables together in a dataframe called soyaData by executing: 

soyaData<-data.frame(Sperm, Soya) 

The data can also be found in the file Soya.dat. If you prefer, load this file into a data¬ 
frame called soyaData by executing: 



soyaData<-read.delim("Soya.dat", header = TRUE) 
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Then the variable Soya, which contains text, will be imported as a factor. This is fine, 
except that whereas when we created this factor ourselves we could specify the order of 
the groups, when we import the data the order of the groups will be alphabetic. For these 
data they will be: 

1 1 soya meal per week 

2 4 soya meals per week 

3 7 soya meals per week 

4 No soya meals per week 

For reasons that will become apparent, it’s useful to have the first level as our control cat¬ 
egory (i.e., no soya), which is how we ordered the groups when entering the data by hand. 
Therefore, we need to reorder the factor levels (look back to R’s Souls’ Tip 3.13). We can 
do this by executing: 

soyaData$Soya<-factor(soyaData$Soya, levels = levels(soyaData$SoycO[c(4, l ; 

2, 3)]) 

This command uses the factor() function to reorder the levels of the Soya variable. It 
re-creates the variable Soya in the soyaData dataframe ( soyaData$Soya ) based on itself, but 
then uses the levels() function to reorder the groups. We simply put the order of the levels 
that we’d like in the c() function, so in this case we have asked for the levels to be ordered 
4, 1, 2, 3, which means that the current fourth group (no soya) will become the first group, 
the current first group will become the second group, and so on. Having executed this 
command, our groups will be ordered: 

1 No soya meals per week 

2 1 soya meal per week 

3 4 soya meals per week 

4 7 soya meals per week 

Having got the data loaded, we would run some exploratory analyses and because we’re 
going to be looking for group differences we need to run these exploratory analyses for 
each group. If you do these analyses (as requested in the self help test) you should find the 
same results shown in Outputs 15.8 and 15.9. 

Output 15.8 shows that the Kruskal-Wallis test is significant for the group that ate no 
soya, W(20) = 0.805, p = .001, one soya meal per week W(20) = .826, p = .002 ), and four 
soya meals, W(20) = 0.743, p < .001. The test for those who ate seven meals per week is 
not quite significant, W(20) = 0.912, p = .07. As such, the data for all of the groups are 
significantly (or close to being) different from normal. 

Output 15.9 shows the results of Levene’s test (section 5.7.1). The assumption of 
homogeneity of variance has been violated, F(3, 76) = 2.86, p = .042. As such, these data 
are not normally distributed, and the groups have heterogeneous variances. 

soyaData$Soya: No Soya 

skewness skew.2SE kurtosis kurt.2SE normtest.W normtest.p 

1.546141 1.509598 2.328051 1.172959 0.805256 0.001036 


soyaData$Soya: 1 Soya Meal 

skewness skew.2SE kurtosis kurt.2SE normtest.W normtest.p 

1.350566 1.318646 1.422732 0.716825 0.825832 0.002154 
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soyaData$Soya: 4 Soya Meals 

skewness skew.2SE kurtosis kurt.2SE normtest.W normtest.p 

1.822237 1.779169 2.792615 1.407024 0.742743 0.000136 


soyaData$Soya: 7 Soya Meals 

skewness skew.2SE kurtosis kurt.2SE normtest.W normtest.p 

0.608671 0.594286 -0.916165 -0.461598 0.912261 0.070391 

Output 15.8 


Levene's Test for Homogeneity of Variance (center = median) 

Df F value Pr(>F) 
group 3 2.8606 0.04237 * 

76 

Signif. codes: 0 '***’ 0.001 '**' 0.01 '*' 0.05 0.1 ' ' 1 

Output 15.9 


15.6.3. 


Doing the Kruskal-Wallis test using R Commander © 


Import the data, using Data=>Import data=>from text file, clipboard, or URL... (see sec¬ 
tion 3.7.3), click on [ ok _| and choose the file Soya.dat. To run the Kruskal-Wallis test, 
select Statistics=>Non parametric tests=>Kruskal-Wallis test to activate the dialog box in 
Figure 15.8. In the box on the left, labelled Groups (pick one), select the variable that 
defines the groups that you want to compare; this variable must be a factor. In our case we 
want to select the variable Soya. On the right, in the list labelled Response variable (pick 
one), choose the outcome variable on which you want to compare groups. In this case, 
we’ll pick Sperm. To run the analysis click on [ ok I. We’ll examine the output shortly. 



FIGURE 15.8 

Dialog box for the 

Kruskal-Wallis 

test 


15.6.4. 


Doing the Kruskal-Wallis test using R © 


The Kruskal-Wallis test is done using the kruskal.test() function, which works in the same 
way as the wilcox.test() function that we used for the rank-sum test. The general form of 
the function is: 

newModelc-kruskal.test(outcome ~ predictor, data = dataFrame, na.action = 
"an.action") 

For the current data, we could, therefore execute: 
kruskal.test(Sperm ~ Soya, data = soyaData) 
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This command does the Kruskal-Wallis test on sperm scores predicted from the soya group 
to which a person belonged. Note that we have executed the command directly, without 
creating a model, which is fine because we don’t really need to use the output of the func¬ 
tion for any reason other than interpretation. 

To interpret the Kruskal-Wallis test, it is useful to obtain the mean rank for each group. 
We can do this by adding a variable called Ranks to the dataframe with the rank() function: 

soyaData$Ranks<-rank(soyaData$Sperm) 

This command creates a variable Ranks in the soyaData dataframe that is the ranks for 
the variable Sperm. We can then obtain the mean rank for each group using the by() and 
mean() functions: 

by(soyaData$Ranks, soyaData$Soya, mean) 


15.6.5. 


Output from the Kruskal-Wallis test © 


Output 15.10 shows the test statistic, H, for the Kruskal-Wallis test (although R labels it 
chi-squared, because of its distribution, rather than H), its associated degrees of freedom (in 
this case we had 4 groups so the degrees of freedom are 4 — 1, or 3) and the significance. 
The crucial thing to look at is the significance value, which is .034; because this value is 
less than .05 we could conclude that the amount of soya meals eaten per week does sig¬ 
nificantly affect sperm counts. Like a one-way ANOVA, though, this test tells us only that 
a difference exists; it doesn’t tell us exactly where the differences lie. One way to get an 
idea is to look at the mean ranks (Output 15.11). These show that the ranks were lowest 
(27.35) in the group that had seven soya meals per week, but fairly similar in the other 
three groups, which implies that any differences might be between the seven soya meals 
group and the other three groups. 

Kruskal-Wallis rank sum test 
data: Sperm by Soya 

Kruskal-Wallis chi-squared = 8.6589, df = 3, p-value = 0.03419 
Output 15.10 

soyaData$Soya: No Soya Meals 
[1] 46.35 


soyaData$Soya: 

1 Soya 

Meal 

[1] 44.15 



soyaData$Soya: 

4 Soya 

Meals 

[1] 44.15 




soyaData$Soya: 7 Soya Meals 
[1] 27.35 

Output 15.11 
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FIGURE 15.9 

Boxplot for the 
sperm counts of 
individuals eating 
different numbers 
of soya meals per 
week 


One way to see which groups differ is to look at a boxplot (see section 4.7) of the groups 
(see Figure 15.9). The first thing to note is that there are some outliers (note the circles 
and asterisks that lie above the top whiskers) - these are men who produced a particularly 
rampant amount of sperm. Using the control as our baseline, the medians of the first three 
groups seem quite similar; however, the median of the group that ate seven soya meals per 
week does seem a little lower, so perhaps this is where the difference lies. However, these 
conclusions are subjective. What we really need are some contrasts or post hoc tests like we 
used in ANOVA (see sections 10.4 and 10.5). 


15.6.6. 


Post hoc tests for the Kruskal-Wallis test © 


One way to do non-parametric post hoc procedures is essentially the 
same as doing Wilcoxon rank-sum tests on all possible comparisons. This 
method is described by Siegel and Castellan (1988) and involves taking 
the difference between the mean ranks of the different groups and com¬ 
paring this to a value based on the value of z (corrected for the number 
of comparisons being done) and a constant based on the total sample size 
and the sample size in the two groups being compared. The inequality is: 


ft, - IT, >z 


a/k(k-l)t 


N(N + 1) 


12 


' 1 p 

— + — 

K n vj 



(15.2) 
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The left-hand side of this inequality is just the difference between the mean rank of the 
two groups being compared, but ignoring the sign of the difference (so the two vertical 
lines that enclose the difference between mean ranks just indicate that if the difference 
is negative then we ignore the negative sign and treat it as positive). For the rest of the 
expression, k is the number of groups (in the soya example, 4), N is the total sample size 
(in this case 80), n is the number of people in the first group that’s being compared (we 
have equal group sizes in the soya example so it will be 20 regardless of which groups we 
compare), and n is the number of people in the second group being compared (again this 
will be 20 regardless of which groups we compare because we have equal group sizes in the 
soya example). The only other thing we need to know is z a/k(k _ y , and to get this value we 
need to decide a level for a, which is the level of significance at which we want to work. 
You should know by now that in the social sciences we traditionally work at a .05 level 
of significance, so a will be .05. We then calculate k(k —1), which for these data will be 
4(4 — 1) = 12. Therefore, a/k(k — 1) = .05/12 = .00417. So, Z ajk(k _ J} just means ‘the value 
of z for which only a/k(k — 1) other values of z are bigger’ (or in this case ‘the value of z 
for which only .00417 other values of z are bigger’). In practical terms this means we go to 
the table in the Appendix, look at the column labelled Smaller Portion and find the number 
.00417 (or the nearest value to this, which, if you look at the table, is .00415), and we then 
look in the same row at the column labelled z. In this case, you should find that the value 
of z is 2.64. The next thing to do is to calculate the right-hand side of inequality (15.2): 


critical difference = 

= 2.64^/540(0.1) 

= 2 . 64 ^ 

= 19.40 

For this example, because the sample sizes across groups are equal, this critical difference 
can be used for all comparisons. However, when sample sizes differ across groups, the criti¬ 
cal difference will have to be calculated for each comparison individually. The next step is 
simply to calculate all of the differences between the mean ranks of all of the groups (the 
mean ranks can be found in Output 15.11), as in Table 15.4. 

Inequality (15.2) basically means that if the difference between mean ranks is bigger than 
or equal to the critical difference for that comparison, then that difference is significant. 
In this case, because we have only one critical difference, it means that if any difference is 


|N(N + 1)( 1 1 


12 
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Table 15.4 Differences between mean ranks for the soya data 


Comparison 

R 

U 

R 

V 

R -R 

U V 

\R-R\ 

• U VI 

No Meals - 1 Meal 

46.35 

44.15 

2.20 

2.20 

No Meals - 4 Meals 

46.35 

44.15 

2.20 

2.20 

No Meals - 7 Meals 

46.35 

27.35 

19.00 

19.00 

1 Meal - 4 Meals 

44.15 

44.15 

0.00 

0.00 

1 Meal - 7 Meals 

44.15 

27.35 

16.80 

16.80 

4 Meals - 7 Meals 

44.15 

27.35 

16.80 

16.80 
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bigger than 19.40, then it is significant. As you can see, all differences are below this value, 
so we would have to conclude that none of the groups were significantly different. 

We can do all of these calculations using the kruskalmc() function from the pgirmess 
package. You use this function in exactly the same way as the kruskal.test function, so for 
the current example, we can do the post hoc tests by executing: 

kruskalmc(Sperm ~ Soya, data = soyaData) 

Output 15.12 shows the output of the function and, as you can see, it lists all of the pos¬ 
sible pairs of groups, along with the critical difference that we calculated above, and the 
absolute difference between mean ranks that we calculated in Table 15.4. Conveniently, 
there is also a column labelled difference that tells us whether the observed difference 
is greater than the critical difference (TRUE) or not (FALSE). In other words, it tells us 
whether or not the difference is significant. In the current example, none of the differences 
are bigger than the critical difference; hence they all say FALSE, which means that the dif¬ 
ferences are all non-significant. 


Multiple comparison test after Kruskal-Wallis 

p.value: 0.05 

Comparisons 



obs.dif 

critical.dif 

difference 

No Soya Meals-1 Soya Meal 

2.2 

19.38715 

FALSE 

No Soya Meals-4 Soya Meals 

2.2 

19.38715 

FALSE 

No Soya Meals-7 Soya Meals 

19.0 

19.38715 

FALSE 

1 Soya Meal-4 Soya Meals 

0.0 

19.38715 

FALSE 

1 Soya Meal-7 Soya Meals 

16.8 

19.38715 

FALSE 

4 Soya Meals-7 Soya Meals 

16.8 

19.38715 

FALSE 


Output 15.12 


One of the problems with comparing every group against all others is that we have to be 
quite strict about accepting a difference as significant, otherwise we will inflate the Type I 
error rate (section 10.2.1). To reduce this problem we could use more focused comparisons. 

In this example, we have a control group that had no soya meals. As such, a nice succinct 
set of comparisons would be to compare each group against the control: 

• Test 1: one soya meal per week compared to no soya meals 

• Test 2: four soya meals per week compared to no soya meals 

• Test 3: seven soya meals per week compared to no soya meals 

This results in three tests, rather than six, so these tests can be less strict than if we compare 
all groups. Fortunately, we can implement this analysis using the kruskalmc() function by 
using the cont option. This option takes the form of cont = ‘one-tailed’ or ‘two-tailed’ 
and, if included, will compare all levels against the first. Therefore, the only complication 
is that we need to make sure that the no-soya group is the first level of the Soya factor. 
Fortunately, we thought ahead and made the no-soya group the first level when we loaded/ 
entered the data into R; however, in other situations you can reorder factor levels if 
necessary (see R’s Souls’ Tip 3.13). Therefore, to compare each group to the no-soya group 
(using a two-tailed test) we simply execute: 

kruskalmc(Sperm ~ Soya, data = soyaData, cont = 'two-tailed') 

Note that the command is exactly the same as before, except that we have added cont = 
‘two-tailed’ to it, which will make it compare all groups to the first group only. 

Output 15.13 shows the results of this test. Note that we have only three tests now and 
consequently our critical difference has decreased (the observed differences between the 
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mean ranks of groups are the same as for the corresponding parts of the previous output). 
Looking at the column labelled difference, we can see that there was a significant differ¬ 
ence between the no-soya and seven soya meals a week group. This result contradicts our 
earlier finding (Output 15.12) in which the test for the no-soya group compared to the 
seven meals group was deemed non-significant; why do you think that is? Well, for our 
current tests, we have done three comparisons and so corrected the critical difference for 
only these three tests. However, our earlier tests corrected for six tests, which resulted in a 
stricter (therefore, larger) critical difference. This example illustrates the benefits of choos¬ 
ing selective comparisons over blindly comparing everything and anything. 


Multiple comparison test after Kruskal-Wallis, treatment 
(two-tailed) 
p.value: 0.05 
Comparisons 

obs.dif critical.dif difference 


No Soya Meals-1 Soya 
No Soya Meals-4 Soya 
No Soya Meals-7 Soya 


Meal 

Meals 

Meals 


2.2 

2.2 

19.0 


15.63787 

15.63787 

15.63787 


FALSE 

FALSE 

TRUE 


VS 


control 


Output 15.13 


15.6.7. 


Testing for trends: the jonckheere-Terpstra test © 


Sometimes we don’t think that groups will just be different, but we want to hypothesize 
a trend. The Jonckheere-Terpstra statistic tests for an ordered pattern to the medians of 
the groups you’re comparing. Essentially it does the same thing as the Kruskal-Wallis test 
(i.e., test for a difference between the medians of the groups) but it incorporates informa¬ 
tion about whether the order of the groups is meaningful. As such, you should use this test 
when you expect the groups you’re comparing to produce a meaningful order of medians. 
So, in the current example we expect that the more soya a person eats, the more their 
sperm count will go down. Therefore, the control group should have the highest sperm 
count, those having one soya meal per week should have a lower sperm count, the sperm 
count in the four meals per week group should be lower still, and the seven meals per 
week group should have the lowest sperm count. Therefore, there is an order to our medi¬ 
ans: they should decrease across the groups. Conversely, there might be situations where 
you expect your medians to increase. For example, there’s a phenomenon in psychology 
known as the ‘mere exposure effect’, which basically means that the more you’re exposed 
to something, the more you’ll like it. Record companies use this to good effect by making 
sure songs are played on radio for about two months prior to their release, so on the day of 
release, everyone loves the song and is dying to have it and rushes out to buy it, sending it 
to number one. 11 Anyway, if you took three groups and exposed them to a song 10 times, 
20 times and 30 times respectively and then measured how much people liked the song, 
you’d expect the medians to increase. Those who heard it 10 times would like it a bit, but 
those who heard it 20 times would like it more, and those who heard it 30 times would 
like it the most. 

The Jonckheere-Terpstra test (actually referred to more often just as the Jonckheere test) 
was designed for these situations. In R, it works on the principle that your coding variable 


11 In most cases the mere exposure effect seems to have the reverse effect on me: the more I hear the manufactured 
rubbish that gets into the charts, the more I want to rid my brain of the mental anguish it creates by making myself 
deaf by ramming hot irons into my ears. 
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(the one that defines the groups) specifies the order in which you expect the medians to 
change (it doesn’t matter whether you expect them to increase or decrease). For our soya 
example, our groups are in the correct order because we entered them that way and if you 
loaded the data from the file, I showed you already how to change the order (see R’s Souls’ 
Tip 3.13). The test determines whether the medians of the groups ascend or descend in the 
order specified by the coding variable ; therefore, given the order of levels for the variable 
Soya, it will test whether the median sperm count increases or decreases across the groups. 
The Jonckheere-Terpstra test is carried out using the jonckheere.test() function, which is 
found in the clinfun package. This function takes the general form: 

jonckheere.test(outcome variable, group variable (as numbers)) 

In other words we only need to specify the name of the outcome variable and the name 
of the grouping variable. The only slight complication is that for the grouping variable 
we need to use the numeric codes rather than the names of the groups (because the func¬ 
tion uses the numbers to determine the order of the groups). However, we can do this by 
putting our grouping variable inside the as.numeric() function, which will return the group 
codes. Therefore, we can conduct a Jonckheere test by executing: 

jonckheere.test(soyaData$Sperm, as.numeric(soyaData$Soya)) 

The results are shown in Output 15.14, which tells us the value of the test statistic, JT, 
which is 912. In large samples (more than about eight per group) this test statistic has a 
sampling distribution that is normal, and a mean and standard deviation that are easily 
defined and calculated (the mean is 1200 and the standard deviation is 116.33). R has 
calculated the p-value for us, which is .013; because this value is less than .05 we have a 
statistically significant trend in the data. We can use the mean ranks (Output 15.11) to see 
that it is a decreasing trend: sperm counts go down as more soya is eaten. 

Jonckheere-Terpstra test 

data: 

JT = 912, p-value = 0.0133 
alternative hypothesis: two.sided 

Output 15.14 


| 



OLIVER TWISTED 

Please Sir, can I have 
some more ... Jonck? 


‘I want to know how the Jonckheere-Terpstra test actually works’, com¬ 
plains Oliver. Of course you do, Oliver, sleep is hard to come by these 
days. I am only too happy to oblige, my little syphilitic friend. The addi¬ 
tional material for this chapter on the companion website has a complete 
explanation of the test and how it works. I bet you’re glad you asked. 


■ 


15 . 6 . 8 . 


Calculating an effect size © 


Unfortunately there isn’t an easy way to convert a chi-square statistic that has more than 
one degree of freedom to an effect size r. You could use the significance value of the 
Kruskal-Wallis test statistic to find an associated value of z from a table of probability 
values for the normal distribution (like that in the Appendix). From this you could use the 
conversion to r that we used in section 15.4.6. However, this kind of effect size is rarely 
that useful (because it’s summarizing a general effect). 
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15 . 6 . 9 . 


Writing and interpreting the results © 


For the Kruskal-Wallis test, we need only report the test statistic (which, as we saw earlier, 
is denoted by H), its degrees of freedom and its significance. So, we could report something 
like: 

^ Sperm counts were significantly affected by eating soya meals, H( 3) = 8.66, p = .034. 
However, we need to report the follow-up tests as well (including their effect sizes): 

S Sperm counts were significantly affected by eating soya meals, H( 3) = 8.66, p = .034. 
Focused comparisons of the mean ranks between groups showed that sperm counts 
were not significantly different when one soya meal ( difference = 2.2 ) or four soya 
meals ( difference = 2.2) were eaten per week compared to none. However, when 
seven soya meals were eaten per week sperm counts were significantly lower than 
when no soya was eaten ( difference = 19). In all cases, the critical difference (a = .05 
corrected for the number of tests) was 15.64. We can conclude that if soya is eaten 
every day it significantly reduces sperm counts compared to eating none; however, 
eating soya less frequently than every day has no significant effect on sperm counts 
(‘phew!’ says the vegetarian man!). 

We might also want to report our trend: 

S Jonckheere’s test revealed a significant trend in the data: as more soya was eaten, the 
median sperm count decreased,/ = 912, p = .013. 



CRAMMING SAM’S TIPS 


The Kruskal-Wallis test 


• The Kruskal-Wallis test compares several conditions when different participants take part in each condition and the resulting 
data violate an assumption of one-way independent ANOVA. 

• Look at the p-value. If the value is less than .05 then the groups are significantly different. 

• You can follow up the main analysis with post hoc tests (ideally, focused ones). If the column labelled difference in the output 
says ‘true’ then the groups differ significantly. 

• If you predict that the means will increase or decrease across your groups in a certain order then do Jonckheere’s trend test. 

• Report the W-statistic, the degrees of freedom and the significance value for the main analysis. Also report the medians and 
their corresponding ranges (or draw a boxplot). 


15.7. Differences between several 
related groups: Friedman’s ANOVA © 


In Chapter 13 we discovered a technique called one-way related ANOVA that could be used 
to test for differences between several related groups. Although, as we’ve seen, robust versions 
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Labcoat Leni’s Real Research 15.2 


Eggs-traordinary! © 


Qetinkaya, H., & Domjan, M. (2006). Journal of Comparative Psychology, 720(4), 427-432. 


There seems to be a lot of sperm in this book (not literally, I hope) - it’s possible that I have a mild obsession. We 
saw in Labcoat Leni’s Real Research 15.1 that male quail fertilized more eggs if they had been trained to be able 
to predict when a mating opportunity would arise. However, some quail develop fetishes. Really. In the previous 
example the type of compartment acted as a predictor of an opportunity to mate, but in studies where a terrycloth 
object acts as a sign that a mate will shortly become available, some quail start to direct their sexual behaviour 
towards the terrycloth object. (I may regret this anology, but, in human terms, if you imagine that every time you 
were going to have sex with your boyfriend you gave him a green towel a few moments before seducing him, then 
after enough seductions he would start rubbing his crotch against any green towel he saw. If you’ve ever won¬ 
dered why you boyfriend rubs his crotch on green towels, then I hope this explanation has been enlightening.) 
In evolutionary terms, this fetishistic behaviour seems counterproductive because sexual behaviour becomes 
directed towards something that cannot provide reproductive success. However, perhaps this behaviour serves 
to prepare the organism for the ‘real’ mating behaviour. 

Hakan Qetinkaya and Mike Domjan conducted a brilliant study in which they sexually conditioned male quail 
(Qetinkaya & Domjan, 2006). All quail experienced the terrycloth stimulus and an opportunity to mate, but for 
some the terrycloth stimulus immediately preceded the mating opportunity (paired group) whereas others expe¬ 
rienced it 2 hours after the mating opportunity (this was the control group because the terrycloth stimulus did not 
predict a mating opportunity). In the paired group, quail were classified as fetishistic or not depending on whether 
they engaged in sexual behaviour with the terrycloth object. 

During a test trial the quail mated with a female and the researchers measured the percentage of eggs ferti¬ 
lized, the time spent near the terrycloth object, the latency to initiate copulation, and copulatory efficiency. If this 
fetishistic behaviour provides an evolutionary advantage then we would expect the fetishistic quail to fertilize more 
eggs, initiate copulation faster and be more efficient in their copulations. 

The data from this study are in the file Cetinkaya & Domjan (2006).dat. Labcoat Leni wants you to carry out 
a Kruskal-Wallis test to see whether fetishist quail produced a higher percentage of fertilized eggs and 
initiated sex more quickly. 

Answers are in the additional material on the companion website (or look at pages 429-430 in the 
original article). 



of ANOVA exist, there is another alternative to the repeated-measures case: Friedman’s ANOVA 
(Friedman, 1937). As such, it is used for testing differences between conditions when there 
are more than two conditions and the same participants have been used in all conditions (each 
case contributes several scores to the data). If you have violated some assumption of paramet¬ 
ric tests then this test can be a useful way around the problem. 

Young people (women especially) can become obsessed with body weight and diets, and, 
because the media are insistent on ramming ridiculous images of stick-thin celebrities down 
our throats (should that be ‘into our eyes’?) and brainwashing us into believing that these ema¬ 
ciated corpses are actually attractive, we all end up terribly depressed that we’re not perfect 
(because we don’t have a couple of slugs stuck to our faces instead of lips). Then corporate 
parasites jump on our vulnerability by making loads of money on diets that will help us attain 
the body beautiful. Well, not wishing to miss out on this great opportunity to exploit people’s 
insecurities, I came up with my own diet called the Andikins diet. 12 The principle is that you 
follow my lifestyle: you eat no meat, drink lots of Darjeeling tea, eat shedloads of lovely 
European cheese, lots of fresh crusty bread, pasta, chocolate at every available opportunity 

12 Not to be confused with the Atkins diet, obviously.© 
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(especially when writing books), then enjoy a few beers at the weekend, play football twice 
a week and play your drum kit for an hour a day or until your neighbour threatens to saw 
your arms off and beat you around the head with them for making so much noise. To test 
the efficacy of my wonderful new diet, I took 10 women who considered themselves to be in 
need of losing weight and put them on this diet for two months. Their weight was measured 
in kilograms at the start of the diet and then after one month and two months. 


15 . 7 . 1 . 


Theory of Friedman’s ANOVA (D 


The theory for Friedman’s ANOVA is much the same as the other tests we’ve seen in this 
chapter: it is based on ranked data. To begin with, you simply place your data for differ¬ 
ent conditions into different columns (in this case there were three conditions, so we have 
three columns). The data for the diet example are in Table 15.5; note that the data are in 
different columns, and so each row represents the weight of a different person. The next 
thing we have to do is rank the data for each person. So, we start with person 1, we look at 
their scores (in this case person 1 weighed 63.75 kg at the start, 65.38 kg after one month 
on the diet, and 81.34 kg after two months on the diet), and then we give the lowest one a 
rank of 1, the next highest a rank of 2 and so on (see section 15.4.1 for more detail). When 
you’ve ranked the data for the first person, you move onto the next person, and starting 
at 1 again, rank their lowest score, then rank the next highest as 2 and so on. You do this 
for all people from whom you’ve collected data. You then simply add up the ranks for each 
condition (R, where i is used to denote the particular group). 



SELF-TEST 

s Have a go at ranking the data and see if you get the 
same results as in Table 15.5. 


Table 15.5 Data for the diet example with ranks 



Start 

Weight 

Month 1 

Month 2 

Start 

(Ranks) 

Weight 

Month 1 
(Ranks) 

Month 2 
(Ranks) 

Person 1 

63.75 

65.38 

81.34 

1 

2 

3 

Person 2 

62.98 

66.24 

69.31 

1 

2 

3 

Person 3 

65.98 

67.70 

77.89 

1 

2 

3 

Person 4 

107.27 

102.72 

91.33 

3 

2 

1 

Person 5 

66.58 

69.45 

72.87 

1 

2 

3 

Person 6 

120.46 

119.96 

114.26 

3 

2 

1 

Person 7 

62.01 

66.09 

68.01 

1 

2 

3 

Person 8 

71.87 

73.62 

55.43 

2 

3 

1 

Person 9 

83.01 

75.81 

71.63 

3 

2 

1 

Person 10 

76.62 

67.66 

68.60 

3 

1 

2 





19 

20 

21 
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Once the sum of ranks has been calculated for each group, the test statistic, F , is calcu¬ 
lated as: 


F„ = 


12 


Nk(k + 1) <=i 


I Rf 


- 3N(k + 1) 


(15.3) 


In this equation, R is the sum of ranks for each group, N is the total sample size (in this 
case 10) and k is the number of conditions (in this case 3). This equation is very similar to 
that for the Kruskal-Wallis test (compare equations (15.1) and (15.3)). All we need to do 
for each condition is square the sum of ranks and then add up these values. That deals with 
the middle part of the equation; the rest of it involves calculating various values based on 
the total sample size and the number of conditions. For these data we get: 


F. = 


12 


(10 x 3)(3 +1) 


(19 2 +20 2 +21 2 ) 


12 


=-(361 + 400 + 441) -120 

120 

= 0 . 1 ( 1202)-120 
= 120 . 2-120 
= 0.2 


(3xl0)(3 + l) 



When the number of people tested is large (bigger than about 10) this test statistic, like 
the Kruskal-Wallis test in the previous section, has a chi-square distribution (see Chapter 
18) and for this distribution there is one value for the degrees of freedom, which is one less 
than the number of groups (k — 1), in this case 2. 


15 . 7 . 2 . 


Inputting data and provisional analysis © 



SELF-TEST 

s Using what you know about inputting data, try to 
enter these data into R and run some exploratory 
analyses (see Chapter 5). 



When the data are collected using the same participants in each condition, the data are 
entered using different columns. So, the data editor will have three columns of data. The 
first column is for the data from the start of the diet (called something like Start), the 
second column will have values for the weights after one month (called Monthl) and 
the final column will have the weights at the end of the diet (called Month2). The data can 
be found in the file Diet.dat. 

Output 15.15 shows the results of some exploratory analysis (using the stat.desc func¬ 
tion from Chapter 5). With a bit of luck you’ll get the same results, which shows that the 
Friedman’s ANOVA is significant for the baseline data (Start), W(10) = 0.78, p = .009, 
and one month into the diet, W(10) = 0.68, p < .001. Therefore the variables Start and 
Monthl deviate significantly from normal. The data at the end of the diet do not appear 
to differ from normal, though, W(10) = 0.87, p = .121. 
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median 
mean 
SE.mean 
Cl.mean.0.95 
var 

std.dev 
coef.var 
skewness 
skew.2SE 
kurtosis 
kurt.2SE 
normtest.W 
normtest,p 

Output 15.15 


Start 
69.225000000 
78.053000000 
6.397375860 
14.471869624 
409.264178889 
20.230278764 
0.259186434 
1.054846022 
0.767671126 
-0.514513976 
-0.192810362 
0.784370035 
0.009357858 


Monthl 
6.857500e+01 
7.746300e+01 
5.886269e+00 
1.331567e+01 
3.464817e+02 
1.861402e+01 
2.402956e-01 
1.329796e+00 
9.677677e-01 
1.196569e-01 
4.484056e-02 
6.849797e-01 
5.796060e-04 


Month2 
72.25000000 
77.06700000 
5.09347536 
11.52224176 
259.43491222 
16.10698334 
0.20899974 
1.01105333 
0.73580071 
0.25546331 
0.09573301 
0.87721476 
0.12120786 


15 . 7 . 3 . 


Doing Friedman’s AN0VA in R Commander © 


As always, import the data, using Data=>Import data=>from text file, clipboard, or URL... 
(see section 3.7.3), click on ok ; and choose the file Diet.dat. To run Friedman’s ANOVA, 
select Statistics=iNonparametric tests=>Friedman rank-sum test... to activate the dialog box 
in Figure 15.10. Once the dialog box is activated, select the three variables that represent 
the dependent variable at the different levels of the independent variable from the list. This 
is very straightforward: we have only three variables, so select them all and click on ! ok |. 


FIGURE 15.10 

Dialog box for 
Friedman’s ANOVA 



15 . 7 . 4 . 


Friedman’s ANOVA using R © 


We can do Friedman’s ANOVA using the friedman.test() function. This function is a bit of 
a prima donna because, in order to work, it demands that (1) you give it a matrix rather 
than a dataframe, because it thinks dataframes smell of rotting brains, and (2) it wants all 
of the variables of interest in one data set, and there mustn’t be any additional variables. To 
combat the first issue we need to convert our dataframe into a matrix by putting it into the 
as.matrix() function (see section 3.9.3). As for the second, in the current example our data¬ 
frame does contain only the variables of interest. However, for other analyses you can use 
what you learnt in section 3.9 to extract only the data that you need for the Friedman test. 

The other complication is that the function gets confused by missing data. Again, we 
have a complete data set in this example so we don’t need to do anything, but if you have 
missing data you need to delete any cases that don’t have a complete set of scores. We can 
do this easily using the na.omitQ function. If we put our dataframe name into that function 
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and execute we’ll get back the same dataframe but with cases that have any missing data 
deleted. Therefore, we could execute: 

dietCompleteCases <- na.omit(dietData) 

This would create a dataframe called dietCompleteCases, which is the same as the dietData 
dataframe except that it will have deleted any case (row) for which there is missing data in 
any column. 

To run the Friedman test we simply input the name of our dataframe, but within the 
as.matrix() function, which converts it to a matrix. In this example, we would execute: 

friedman.test(as.matrix(dietData)) 


15 . 7 . 5 . 


Output from Friedman’s ANOVA © 


Output 15.16 shows the result of the Friedman test: the main part is the test statistic, which 
R calls chi-squared rather that F because F has a chi-square distribution). The value of this 
statistic is 0.2, the same value that we calculated earlier. We’re also told the test statistic’s 
degrees of freedom (in this case we had three groups so the degrees of freedom are 3 — 1, or 
2), and the significance. The significance value is .905, which is well above .05, therefore 
we could conclude that the there is no evidence that the Andikins diet has any effect: the 
weights didn’t significantly change over the course of the diet. 

Friedman rank sum test 
data: just.diet 

Friedman chi-squared = 0.2, df = 2, p-value = 0.9048 

Output 15.16 


15 . 7 . 6 . 


Post hoc tests for Friedman’s ANOVA © 


In normal circumstances we wouldn’t do any follow-up tests because the overall effect from 
Friedman’s ANOVA was not significant. However, in case you get a result that is significant 
we will have a look at what options you have. As with the Kruskal-Wallis test, there is a func¬ 
tion that enables us to compare all groups, or to compare groups to a baseline. This function, 
friedmanmcO, requires the data to be in exactly the same format as the friedman.test() func¬ 
tion and we use it in exactly the same way. Therefore, for the current data we would execute: 

friedmanmcCas.matrix(dietData)) 

The results are in Output 15.17. As with the Kruskal-Wallis test, you need to look at the 
column labelled differences-, if this says TRUE then the groups differ significantly, but if 
it says FALSE, they don’t. In this case, we have a clean sweep of non-significant results 
(which, given the main test was ragingly non-significant, isn’t a surprise). 

Multiple comparisons between groups after Friedman test 

p.value: 0.05 

Comparisons 

obs.dif critical.dif difference 
1-2 1 10.7062 FALSE 

1- 3 2 10.7062 FALSE 

2- 3 1 10.7062 FALSE 

Output 15.17 
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15 . 7 . 7 . 


Calculating an effect size (D 


As I mentioned before, there isn’t an easy way to convert a chi-square statistic that has 
more than one degree of freedom to an effect size r and, in any case, it’s not always that 
helpful to have an effect size for a general effect like that tested by Friedman’s ANOVA. 13 
Therefore, it’s more sensible (in my opinion at least) to calculate effect sizes for any com¬ 
parisons you’ve done after the ANOVA. As we saw in section 15.5.5, it’s straightforward to 
get an effect size r from the Wilcoxon signed-rank test. Therefore, you could conduct some 
Wilcoxon tests and get the effect sizes using the rFromWilcoxQ function. 


15 . 7 . 8 . 


Writing and interpreting the results © 


For Friedman’s ANOVA we need only report the test statistic (which we saw earlier is 
denoted by x 2 ), 14 its degrees of freedom and its significance. So, we could report something 
like: 




The weight of participants did not significantly change over the two months of the 
diet, x 2 (2) = 0.20, p > .05. 


Although with no significant initial analysis we wouldn’t report post hoc tests for these 
data, in case you need to, you should say something like this: 



^ The weight of participants did not significantly change over the two months of the 
diet, / 2 (2) = 0.20, p > .05. Post hoc tests were used with Bonferroni correction applied. 
It appeared that weight didn’t significantly change from the start of the diet to one 
month, (difference = 1), from the start of the diet to two months, (difference = 2), 
or from one month to two months, (difference = 1). In all cases, the critical difference 
(a = .05 corrected for the number of tests) was 10.71. We can conclude that the 
Andikins diet, like its creator, is a complete failure. 


CRAMMING SAM’S TIPS 


Friedman’s ANOVA 


• Friedman’s ANOVA compares several conditions when the same participants take part in each condition and the resulting 
data violate an assumption of one-way repeated-measures ANOVA. 

• Look at the row labelled p-value. If the value is less than .05 then the conditions are significantly different. 

• You can follow up the main analysis with post hoc tests using the friedmanmcO function. Look at the column labelled differ¬ 
ences'. if it says TRUE then the groups differ significantly. 

• Report the x 2 statistic, its degrees of freedom and significance. 

• Report the medians and their ranges (or draw a boxplot). 


13 If you really want to, though, you can (as with the Kruskal-Wallis test) use the significance value of the chi- 
square test statistic to find an associated value of z from a table of probability values for the normal distribution 
(see Appendix) and then use the conversion to r that we’ve seen throughout this chapter. 


14 You might also see it denoted as x 3 . 
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What have I discovered about statistics? © 


This chapter has dealt with an alternative approach to violations of parametric assump¬ 
tions, which is to use tests based on ranking the data. We started with the Wilcoxon 
rank-sum test, which is used for comparing two independent groups. This test allowed 
us to look in some detail at the process of ranking data. We then moved on to look at the 
Wilcoxon signed-rank test, which is used to compare two related conditions. We moved 
onto more complex situations in which there are several conditions (the Kruskal-Wallis 
test for independent conditions and Friedman’s ANOVA for related conditions). For 
each of these tests we looked at the theory of the test (although these sections could 
be ignored) and then focused on how to conduct them using R, how to interpret the 
results and how to report the results of the test. In the process we discovered that drugs 
make you depressed, soya reduces your sperm count, and my lifestyle is not conducive 
to losing weight. 

We also discovered that my teaching career got off to an inauspicious start. As it 
turned out, one of the reasons why the class did not have a clue what I was talking 
about was that I hadn’t been shown their course handouts and I was trying to teach them 
ANOVA using completely different equations than their lecturer (there are many ways 
to compute an ANOVA). The other reason was that I was a rubbish teacher. This event 
did change my life, though, because the experience was so awful that I did everything in 
my power to make sure that it didn’t happen again. After years of experimentation I can 
now pass on the secret of avoiding students telling you how awful your ANOVA classes 
are: the more penis jokes you tell, the less likely you are to be emotionally crushed by 
dissatisfied students. 


R packages used in this chapter 

clinfun pgirmess 

ggplot2 Rcmdr 

pastecs 


R functions used in this chapter 


as.matrix() 
as.numeric!) 

byO 

data.frame() 

friedmanmcO 

friedman.test() 

gi() 

kruskalmcO 

kruskal.test() 

length!) 

leveneTest() 

jonckheere.test() 


mean() 

min() 

na.omit() 

qnorm() 

rank() 

rFromWilcox() 

sqrt() 

stat.desc() 

subset() 

sum() 

wilcox.test() 
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Key terms that I’ve discovered 


Friedman's ANOVA 
Jonckheere-Terpstra test 
Kruskal-Wallis test 
Mann-Whitney test 
Monte Carlo method 


Non-parametric tests 
Ranking 

Wilcoxon rank-sum test 
Wilcoxon signed-rank test 


Smart Alex’s tasks 



• Task 1: A psychologist was interested in the cross-species differences between men 
and dogs. She observed a group of dogs and a group of men in a naturalistic setting 
(20 of each). She classified several behaviours as being dog-like (urinating against 
trees and lampposts, attempts to copulate with anything that moved, and attempts to 
lick their own genitals). For each man and dog she counted the number of dog-like 
behaviours displayed in a 24-hour period. It was hypothesized that dogs would dis¬ 
play more dog-like behaviours than men. The data are in the file MenLikeDogs.dat. 
Analyse them with a Wilcoxon rank-sum test. © 

• Task 2: There’s been much speculation over the years about the influence of sublimi¬ 
nal messages on records. To name a few cases, both Ozzy Osbourne and Judas Priest 
have been accused of putting backward masked messages on their albums that sub- 
liminally influence poor unsuspecting teenagers into doing things like blowing their 
heads off with shotguns. A psychologist was interested in whether backward masked 
messages really did have an effect. He took the master tapes of Britney Spears’ ‘Baby 
One More Time’ and created a second version that had the masked message ‘deliver 
your soul to the dark lord’ repeated in the chorus. He took this version, and the origi¬ 
nal, and played one version (randomly) to a group of 32 people. He took the same 
group six months later and played them whatever version they hadn’t heard the time 
before. So each person heard both the original, and the version with the masked mes¬ 
sage, but at different points in time. The psychologist measured the number of goats 
that were sacrificed in the week after listening to each version. It was hypothesized 
that the backward message would lead to more goats being sacrificed. The data are in 
the file DarkLord.dat. Analyse them with a Wilcoxon signed-rank test. © 

• Task 3: A psychologist was interested in the effects of television programmes on 
domestic life. She hypothesized that through ‘learning by watching’, certain pro¬ 
grammes might actually encourage people to behave like the characters within them. 
This in turn could affect the viewer’s own relationships (depending on whether the 
programme depicted harmonious or dysfunctional relationships). She took episodes 
of three popular TV shows and showed them to 54 couples, after which the couple 
were left alone in the room for an hour. The experimenter measured the number 
of times the couple argued. Each couple viewed all three of the TV programmes 
at different points in time (a week apart) and the order in which the programmes 
were viewed was counterbalanced over couples. The TV programmes selected were 
EastEnders (which typically portrays the lives of extremely miserable, argumentative, 
London folk who like nothing more than to beat each other up, lie to each other, 
sleep with each other’s wives and generally show no evidence of any consideration to 
their fellow humans!), Friends (which portrays a group of unrealistically considerate 
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and nice people who love each other oh so very much - but for some reason I love it 
anyway!), and a National Geographic programme about whales (this was supposed to 
act as a control). The data are in the file Eastenders.dat. Access the file and conduct 
Friedman’s ANOVA on the data. © 


• Task 4: A researcher was interested in trying to prevent coulrophobia (fear of clowns) 
in children. She decided to do an experiment in which different groups of children 
(15 in each) were exposed to different forms of positive information about clowns. 
The first group watched some adverts for McDonald’s in which their mascot Ronald 
McDonald is seen cavorting about with children going on about how they should 
love their mums. A second group was told a story about a clown who helped some 
children when they got lost in a forest (although what on earth a clown was doing 
in a forest remains a mystery). A third group was entertained by a real clown, who 
came into the classroom and made balloon animals for the children. 15 A final group 
acted as a control condition and they had nothing done to them at all. The researcher 
took self-report ratings of how much the children liked clowns, resulting in a score 
for each child that could range from 0 (not scared of clowns at all) to 5 (very scared 
of clowns). The data are in the file coulrophobia.dat. Access them and conduct a 
Kruskal-Wallis test. © 

Answers can be found on the companion website and, because these examples are used in 
Field and Hole (2003), you could steal this book or photocopy Chapter 7 to get some very 
detailed answers. 



Further reading 


Siegel, S., 8c Castellan, N. J. (1988). Nonparatnetric statistics for the behavioral sciences (2nd ed.). 
New York: McGraw-Hill. (This has become the definitive text on non-parametric statistics, and is 
the only book seriously worth recommending as ‘further’ reading. It is probably not a good book 
for anyone with a statistics phobia, but if you’ve coped with my chapter then this book will be an 
excellent next step.) 

Wilcox, R. R. (2005). Introduction to robust estimation and hypothesis testing (2nd ed.). Burlington, 
MA: Elsevier. (Wilcox’s book is quite technical compared to this one, but really is a wonderful 
resource. Wilcox describes how to use an astonishing range of robust tests, many of which we 
discuss throughout this book.) 


Interesting real research 


(Jetinkaya, H., 8c Domjan, M. (2006). Sexual fetishism in a quail (Coturnix japonica) model system: 

Test of reproductive success. Journal of Comparative Psychology, 120(4), 427-432. 

Matthews, R. C., Domjan, M., Ramsey, M., 8t Crews, D. (2007). Learning effects on sperm competi¬ 
tion and reproductive fitness. Psychological Science, 18(9), 758-762. 


15 Unfortunately, the first time they attempted the study the clown accidentally burst one of the balloons. 
The noise frightened the children and they associated that fear response with the clown. All 15 children are 
currently in therapy for coulrophobia! 





Multivariate analysis of 
16 variance (MANOVA) 



FIGURE 16.1 

Fuzzy doing some 
light reading 



16.1. What will this chapter tell me? © 


Having had what little confidence I had squeezed out of me by my formative teaching expe¬ 
riences, I decided that I could either kill myself, or get a cat. I’d wanted to do both for years, 
but when I was introduced to a little 4-week-old bundle of gingerness the choice was made. 
Fuzzy (as I named him) was born on 8 April 1996 and has been my right-hand feline ever 
since. He is like the Cheshire cat in Lewis Carroll’s Alice’s Adventures in 'Wonderland, 1 in that 
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1 This is one of my favourite books from my childhood. For those that haven’t read it, the Cheshire cat is a big 
fat cat mainly remembered for vanishing and reappearing out of nowhere; on one occasion it vanished leaving 
only its smile behind. 
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he seemingly vanishes and reappears at will: I go to find clothes in my wardrobe and notice 
a ginger face peering out at me, I put my pants in the laundry basket and he looks up at me 
from a pile of smelly socks, I go to have a bath and he’s sitting in it, and I shut the bedroom 
door yet wake up to find him asleep next to me. His best vanishing act was a few years 
ago when I moved house. He’d been locked up in his travel basket (which he hates) during 
the move, so once we were in our new house I thought I’d let him out as soon as possible. 
I found a quiet room, checked the doors and windows to make sure he couldn’t escape, 
opened the basket, gave him a cuddle and left him to get to know his new house. When I 
returned five minutes later, he was gone. The door had been shut, the windows closed and 
the walls were solid (I checked). He had literally vanished into thin air and he didn’t even 
leave behind his smile. Before his dramatic disappearance, Fuzzy had stopped my suicidal 
tendencies, and there is lots of research showing that having a pet is good for your mental 
health. If you wanted to test this you could compare people with pets against those without 
to see if they had better mental health. However, the term mental health covers a wide range 
of concepts, including (to name a few) anxiety, depression, general distress and psychosis. 
As such, we have four outcome measures and all the tests we have encountered allow us to 
look at one. Fear not, when we want to compare groups on several outcome variables we 
can extend ANOVA to become MANOVA. That’s what this chapter is all about. 


16.2. When to use MANOVA © 


Over Chapters 9-14, we have seen how the general linear model (GLM) can be 
used to detect group differences on a single dependent variable. However, there 
may be circumstances in which we are interested in several dependent variables, 
and in these cases the simple ANOVA model is inadequate. Instead, we can use an 
extension of this technique known as multivariate analysis of variance (or MANOVA). 
MANOVA can be thought of as ANOVA for situations in which there are several 
dependent variables. The principles of ANOVA extend to MANOVA in that we 
can use MANOVA when there is only one independent variable or when there are 
several, we can look at interactions between independent variables, and we can 
even do contrasts to see which groups differ from each other. ANOVA can be used 
only in situations in which there is one dependent variable (or outcome) and so is known as 
a univariate test (univariate quite obviously means ‘one variable’); MANOVA is designed to 
look at several dependent variables (outcomes) simultaneously and so is a multivariate test 
(multivariate means ‘many variables’). This chapter will explain some basics about MANOVA 
for those of you who want to skip the fairly tedious theory sections and just get on with the 
test. However, for those who want to know more there is a fairly lengthy theory section to 
try to explain the workings of MANOVA. We then look at an example using R and see how 
the output from MANOVA can be interpreted. This leads us to look at another statistical test 
known as discriminant function analysis. 



16.3. Introduction: similarities to 
and differences from ANOVA © 


If we have collected data about several dependent variables then we could simply conduct a 
separate ANOVA for each dependent variable (and if you read research articles you’ll find 
that it is not unusual for researchers to do this). Think back to Chapter 10, and you should 
remember that a similar question was posed regarding why ANOVA was used in prefer¬ 
ence to multiple t-tests. The reason why MANOVA is used instead of multiple ANOVAs is 
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the same: the more tests we conduct on the same data, the more we inflate the 
familywise error rate (see section 10.2.1). The more dependent variables we 
have measured, the more ANOVAs we would need to conduct and the greater 
the chance of making a Type I error. 

However, there are other reasons for preferring MANOVA to several 
ANOVAs. For one thing, there is important additional information that is 
gained from a MANOVA. If separate ANOVAs are conducted on each depen¬ 
dent variable, then any relationship between dependent variables is ignored. 
As such, we lose information about any correlations that might exist between 
the dependent variables. MANOVA, by including all dependent variables in the same 
analysis, takes account of the relationship between outcome variables. Related to this 
point, ANOVA can tell us only whether groups differ along a single dimension, whereas 
MANOVA has the power to detect whether groups differ along a combination of dimen¬ 
sions. For example, ANOVA tells us how scores on a single dependent variable distinguish 
groups of participants (so, for example, we might be able to distinguish people who are 
married, living together or single by their happiness). MANOVA incorporates information 
about several outcome measures and, therefore, informs us of whether groups of partici¬ 
pants can be distinguished by a combination of scores on several dependent measures. For 
example, ‘happiness’ is a complex construct, so we might want to measure participants’ 
happiness with work, socially, sexually and within themselves (self-esteem). It may not be 
possible to distinguish people who are married, living together or single only by their hap¬ 
piness at work, but they might be distinguished by a combination of their happiness across 
all four domains: work, social, sexual, and the self. So, in this sense MANOVA has greater 




JANE SUPERBRAIN 16.1 

The power of MANOVA (D 

I mentioned in the previous section that MANOVA had 
greater power than ANOVA to detect effects because it 
could take account of the correlations between depend¬ 
ent variables (Huberty & Morris, 1989). However, the 
issue of power is more complex than alluded to by my 
simple statement. Ramsey (1982) found that as the 
correlation between dependent variables increased, 
the power of MANOVA decreased. This led Tabachnick 
and Fidell (2007) to recommend that MANOVA ‘works 
best with highly negatively correlated DVs, and accept¬ 
ably well with moderately correlated DVs in either direc¬ 
tion' and that ‘MANOVA also is wasteful when DVs are 
uncorrelated’ (p. 268). In contrast, Stevens’s (1980) 
investigation of the effect of dependent variable cor¬ 
relations on test power revealed that ‘the power with 


high intercorrelations is in most cases greater than that 
for moderate intercorrelations, and in some cases it is 
dramatically higher’ (p. 736). These findings are slightly 
contradictory, which leaves us with the puzzling conun¬ 
drum of what, exactly, the relationship is between power 
and intercorrelation of the dependent variables. Luckily, 
Cole, Maxwell, Arvey, and Salas (1994) have done a great 
deal to illuminate this relationship. They found that the 
power of MANOVA depends on a combination of the 
correlation between dependent variables and the effect 
size to be detected. In short, if you are expecting to find 
a large effect, then MANOVA will have greater power if 
the measures are somewhat different (even negatively 
correlated) and if the group differences are in the same 
direction for each measure. If you have two dependent 
variables, one of which exhibits a large group difference, 
and one of which exhibits a small or no group differ¬ 
ence, then power will be increased if these variables are 
highly correlated. The take-home message from Cole et 
al.’s work is that if you are interested in how powerful the 
MANOVA is likely to be you should consider not just the 
intercorrelation of dependent variables but also the size 
and pattern of group differences that you expect to get. 
However, it should be noted that Cole et al.’s work is lim¬ 
ited to the case where two groups are being compared, 
and power considerations are more complex in multiple- 
group situations. 
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power to detect an effect, because it can detect whether groups differ along a combina¬ 
tion of variables, whereas ANOVA can detect only if groups differ along a single variable 
(see Jane Superbrain Box 16.1). For these reasons, MANOVA is preferable to conducting 
several ANOVAs. 


16 . 3 . 1 . 


Words of warning © 


From my description of MANOVA it is probably looking like a pretty groovy little test that 
allows you to measure hundreds of dependent variables and then just sling them into the 
analysis. This is not the case. It is not a good idea to lump all of your dependent variables 
together in a MANOVA unless you have a good theoretical or empirical basis for doing so. 
I mentioned way back at the beginning of this book that statistical procedures are just a 
way of number crunching and so even if you put rubbish into an analysis you will still reach 
conclusions that are statistically meaningful, but are unlikely to be empirically meaningful. 
In circumstances where there is a good theoretical basis for including some but not all of 
your dependent variables, you should run separate analyses: one for the variables being 
tested on a heuristic basis and one for the theoretically meaningful variables. The point to 
take on board here is not to include lots of dependent variables in a MANOVA just because 
you have measured them. 


16 . 3 . 2 . 


The example for this chapter © 


Throughout the rest of this chapter we’re going to use a single example to look at how 
MANOVA works and then how to conduct one using R. Imagine that we were interested in 
the effects of cognitive behaviour therapy (CBT) on obsessive compulsive disorder (OCD). 
OCD is a disorder characterized by intrusive images or thoughts that the sufferer finds 
abhorrent (in my case this might be the thought of someone carrying out a t-test on data 
that are not normally distributed, but in normal people it could be something like imagin¬ 
ing your parents have died). These thoughts lead the sufferer to engage in activities to neu¬ 
tralize the unpleasantness of these thoughts (these activities can be mental, such as doing 
a MANOVA in my head to make me feel better about the t-test thought, or physical, such 
as touching the floor 23 times so that your parents won’t die). Now, we could compare a 
group of OCD sufferers after CBT and after behaviour therapy (BT) with a group of OCD 
sufferers who are still awaiting treatment (a no-treatment condition, NT). 2 There are both 
behavioural and cognitive elements to most psychopathologies. For example, in OCD if 
someone had an obsession with germs and contamination, this disorder might manifest 
itself in obsessive hand-washing and would influence not just how many times they actu¬ 
ally wash their hands (behaviour), but also the number of times they think about washing 
their hands (cognitions). If we are interested in seeing how successful a therapy is, it is not 
enough to look only at behavioural outcomes (such as whether obsessive behaviours are 
reduced); it is important to establish whether cognitions are being changed also. Hence, 
in this example two dependent measures were taken: the occurrence of obsession-related 
behaviours (Actions) and the occurrence of obsession-related cognitions (Thoughts). 
These dependent variables were measured on a single day and so represent the number of 
obsession-related behaviours/thoughts in a normal day. 


2 The non-psychologists out there should note that behaviour therapy works on the basis that if you stop the 
maladaptive behaviours the disorder will go away, whereas cognitive therapy is based on the idea that treating the 
maladaptive cognitions will stop the disorder. 
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Table 16.1 Data from OCD.dat 


Group: 

CBT (1) 

DV 1: Actions 

BT (2) 

NT (3) 

CBT (1) 

DV 2: Thoughts 

BT (2) 

NT (3) 


5 

4 

4 

14 

14 

13 


5 

4 

5 

11 

15 

15 


4 

1 

5 

16 

13 

14 


4 

1 

4 

13 

14 

14 


5 

4 

6 

12 

15 

13 


3 

6 

4 

14 

19 

20 


7 

5 

7 

12 

13 

13 


6 

5 

4 

15 

18 

16 


6 

2 

6 

16 

14 

14 


4 

5 

5 

11 

17 

18 

X 

4.90 

3.70 

5.00 

13.40 

15.20 

15.00 

s 

1.20 

1.77 

1.05 

1.90 

2.10 

2.36 

s 2 

1.43 

3.12 

1.11 

3.60 

4.40 

5.56 


Xgrand(Actions) =4.53 


Xgrand(Thoughts) = 14.53 



2 

Sgrand(Actions) — 2.1195 

2 

Sgrand(Thoughts) — 4.8780 



The data are in Table 16.1 and can be found in the file OCD.dat. Participants belonged 
to group 1 (CBT), group 2 (BT) or group 3 (NT), and within these groups all participants 
had both actions and thoughts measured. 


16.4. Theory of MANOVA (D 



The theory of MANOVA is very complex to understand without knowing matrix algebra, 
and frankly matrix algebra is way beyond the scope of this book (those with maths brains 
can consult Namboodiri, 1984; Stevens, 2002). However, I intend to give a flavour of the 
conceptual basis of MANOVA, using matrices, without requiring you to understand exactly 
how those matrices are used. Those interested in the exact underlying theory of MANOVA 
should read Bray and Maxwell’s (1985) superb monograph. 


16 . 4 . 1 . 


Introduction to matrices © 


A matrix is simply a collection of numbers arranged in columns and rows. In fact, through¬ 
out this book you have been using matrices: every dataframe you have created is a matrix 
but with names for each column. In dataframes we have numbers arranged in columns and 
rows and this is a matrix. A matrix can have many columns and many rows and we usually 
specify the dimensions of the matrix using numbers. So, a 2 x 3 matrix is a matrix with 
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two rows and three columns, and a 5 x 4 matrix is one with five rows and four columns 
(examples below): 




'2 

4 

6 8' 

(2 5 6' 


3 

4 

6 7 

U 5 8 , 


4 

3 

5 8 


2 

5 

7 9 



,4 

6 

6 9, 

2 x 3 matrix 

5 

x 4 matrix 


In many of our dataframes we have thought of each row as representing the data from a 
single participant and each column as representing data relating to a particular variable. So, 
for the 5x4 matrix, we can imagine a situation where five participants were tested on four 
variables: the first participant scored 2 on the first variable and 8 on the fourth variable. 
The values within a matrix are typically referred to as components or elements. 

A square matrix is one in which there are equal numbers of columns and rows. In this 
type of matrix it is sometimes useful to distinguish between the diagonal components (i.e., 
the values that lie on the diagonal line from the top left component to the bottom right 
component) and the off-diagonal components (the values that do not lie on the diagonal). 
In the matrix below, the diagonal components are 5, 12, 2 and 6 because they lie along the 
diagonal line. The off-diagonal components are all of the other values. A square matrix in 
which the diagonal elements are equal to 1 and the off-diagonal elements are equal to 0 is 
known as an identity matrix: 



Square matrix 


Identity matrix 


Hopefully, the concept of a matrix should now be slightly less scary than it was previously: 
it is not some magical mathematical entity, merely a way of representing a data set - just 
like a spreadsheet. 

Now, there is a special case of a matrix where there are data from only one entity, and 
this is known as a row vector. Likewise, if there is only one column in a matrix this is 
known as a column vector. In the examples below, the row vector can be thought of as a 
single person’s score on four different variables, whereas the column vector can be thought 
of as five participants’ scores on one variable: 


(2 6 4 8) 

Row vector 


^8 N 

6 

10 

15 

l 6 , 

Column vector 
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Armed with this knowledge of what vectors are, we can have a brief look at how they 
are used to conduct MANOVA. 


16 . 4 . 2 . 


Some important matrices and their functions (D 


As with ANOVA, we are primarily interested in how much variance can be explained by the 
experimental manipulation (which in real terms means how much variance is explained by 
the fact that certain scores appear in certain groups). Therefore, we need to know the sum 
of squares due to the grouping variable (the systematic variation, SS M ), the sum of squares 
due to natural differences between participants (the residual variation, SS R ) and of course 
the total amount of variation that needs to be explained (SS T ); for more details about these 
sources of variation reread Chapters 7 and 10. However, I mentioned that MANOVA also 
takes into account several dependent variables simultaneously and it does this by using 
a matrix that contains information about the variance accounted for by each dependent 
variable. For the univariate F’-tcst (e.g., ANOVA) we calculated the ratio of systematic 
variance to unsystematic variance for a single dependent variable. In MANOVA, the test 
statistic is derived by comparing the ratio of systematic to unsystematic variance for several 
dependent variables. This comparison is made by using the ratio of a matrix representing 
the systematic variance of all dependent variables to a matrix representing the unsystem¬ 
atic variance of all dependent variables. To sum up, the test statistic in both ANOVA and 
MANOVA represents the ratio of the effect of the systematic variance to the unsystematic 
variance; in ANOVA these variances are single values, but in MANOVA each is a matrix 
containing many variances and covariances. 

The matrix that represents the systematic variance (or the model sum of squares for all 
variables) is denoted by the letter H and is called the hypothesis sum of squares and cross- 
products matrix (or hypothesis SSCP). The matrix that represents the unsystematic variance 
(the residual sums of squares for all variables) is denoted by the letter E and is called the 
error sum of squares and cross-products matrix (or error SSCP). Finally, there is a matrix that 
represents the total amount of variance present for each dependent variable (the total sums 
of squares for each dependent variable) and this is denoted by T and is called the total sum 
of squares and cross-products matrix (or total SSCP). 

Later, I will show how these matrices are used in exactly the same way as the simple sums 
of squares (SS M , SS R and SS T ) in ANOVA to derive a test statistic representing the ratio of 
systematic to unsystematic variance in the model. The observant among you may have 
noticed that the matrices I have described are all called sum of squares and cross-products 
(SSCP) matrices. It should be obvious why these matrices are referred to as sum of squares 
matrices, but why is there a reference to cross-products in their name? 



SELF-TEST 

s Can you remember (from Chapter 6) what a cross- 
product is? 


Cross-products represent a total value for the combined error between two variables (so in 
some sense they represent an unstandardized estimate of the total correlation between two 
variables). As such, whereas the sum of squares of a variable is the total squared difference 
between the observed values and the mean value, the cross-product is the total combined 
error between two variables. I mentioned earlier that MANOVA had the power to account for 
any correlation between dependent variables, and it does this by using these cross-products. 
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16 . 4 . 3 . 


Calculating MANOVA by hand: a worked example (D 


To begin with, let’s carry out univariate ANOVAs on each of the two dependent variables 
in our OCD example (see Table 16.1). A description of the ANOVA model can be found 
in Chapter 10, and I will draw heavily on the assumption that you have read this chapter; 
if you are hazy on the details of Chapter 10 then now would be a good time to (re)read 
sections 10.2.5-10.2.9. 


16.4.3.1. Univariate ANOVA for DV 1 (Actions) (D 


There are three sums of squares that need to be calculated. First, we need to assess how 
much variability there is to be explained within the data (SS T ). Next, we need to see 
how much of this variability can be explained by the model (SS M ). Finally, we have to assess 
how much error there is in the model (SS R ). From Chapter 10 we can calculate each of 
these values: 

# SS T(Actions)‘ The t0tal sum of squares is obtained by calculating the difference between 
each of the 20 scores and the mean of those scores, then squaring these differences 
and adding these squared values up. Alternatively, you can get R to calculate the 
variance for the action data (regardless of which group the score falls into) and then 
multiplying this value by the number of scores minus 1: 

SS T =4and( W - ] ) 

= 2.1195(30-1) 

= 2.1195x2 9 
= 61.47 

* ^^M(Actions)- This value is calculated by taking the difference between each group mean 
and the grand mean and then squaring them. Multiply these values by the number of 
scores in the group and then add them together: 

SS M = 10(4.90 - 4.53) 2 +10(3.70 - 4.53) 2 +10(5.00 - 4.53) 2 
= 10(0.37) 2 +10(-0.83) 2 +10(0.47) 2 
= 1.37 + 6.89 + 2.21 
= 10.47 

• SS R(Actjons) : This value is calculated by taking the difference between each score and 
the mean of the group from which it came. These differences are then squared and 
then added together. Alternatively we can get R to calculate the variance within each 
group, multiply each group variance by the number of scores minus 1 and then add 
them together: 

SSr = s cbt( w cbt ~ 1) + s bt( w bt ~ 1) + s nt( w nt ~~1) 

= (1.433)(10 -1) + (3.122)(10 -1) + (1.111)(10 -1) 

= (1.433x9)+(3.122x9)+ (1.111x9) 

= 12.9 + 28.1 + 10.0 
= 51.00 
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The next step is to calculate the average sums of squares (the mean square) of each by 
dividing by the degrees of freedom (see section 10.2.8): 


SS 

df 

MS 

^^M(Actions) ~ 10-47 

2 

5.235 

SS R(Actlons) = 51 - 00 

27 

1.889 


The final stage is calculate F by dividing the mean squares for the model by the mean 
squares for the error in the model: 


MS^SJSS 

MS r 1.889 


This value can then be evaluated against critical values of F. The point to take home here is 
the calculation of the various sums of squares and to what each one relates. 


16.4.3.2. Univariate ANOVA for DV 2 (Thoughts) (D 

As with the data for dependent variable 1, there are three sums of squares that need to be 
calculated: 

• SS 

T(Thoughts)* 

SS T = 5 grand( M _ 1) 

= 4.878(30-1) 

= 4.878x29 
= 141.46 


• SS 

M(Thoughts) * 

SS M = 10(13.40 -14.53) 2 +10(15.2 -14.53) 2 +10(15.0 -14.53) 2 
= 10(—1.13) 2 +10(0.67) 2 +10(0.47) 2 
= 12.77 + 4.49 + 2.21 
= 19.47 

• ss 

^RfThoughts) * 

SSr = Scbt( w CBT ~ + s bt( w bt ~ 1) + s nt( w nt ~ 1) 

= (3.6)(10 -1) + (4.4)(10 — 1) + (5.56)(10 -1) 

= (3.6x9)+ (4.4x9)+ (5.56x9) 

= 32.4 + 39.6 + 50.0 
= 122 
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The next step is to calculate the average sums of squares (the mean square) of each by 
dividing by the degrees of freedom (see section 10.2.8): 


ss 

df 

MS 

^MfThoughts) - 19.47 

2 

9.735 

ss R(Thoughts) = 122 - 00 

27 

4.519 


The final stage is to calculate F by dividing the mean squares for the model by the mean 
squares for the error in the model: 


ms m 

ms r 


9.735 

4.519 


2.154 


This value can then be evaluated against critical values of F. Again, the point to take home 
here is the calculation of the various sums of squares and to what each one relates. 


16.4.3.3. The relationship between DVs: cross-products © 


We know already that MANOVA uses the same sums of squares as ANOVA, and in the 
next section we will see exactly how it uses these values. However, I have also mentioned 
that MANOVA takes account of the relationship between dependent variables by using 
the cross-products. There are three different cross-products that are of interest, and these 
three cross-products relate to the three sums of squares that we calculated for the univari¬ 
ate ANOVAs: that is, there is a total cross-product, a cross-product due to the model and a 
residual cross-product. Let’s look at the total cross-product (CP T ) first. 

I mentioned in Chapter 6 that the cross-product was the difference between the scores 
and the mean in one group multiplied by the difference between the scores and the mean 
in the other group. In the case of the total cross-product, the mean of interest is the grand 
mean for each dependent variable (see Table 16.2). Hence, we can adapt the cross-product 
equation described in Chapter 6 using the two dependent variables. The resulting equation 
for the total cross-product is as follows: 

— ^ (^'(Actions) ^grand (Actions) )("T (Thoughts) ^grand (Thoughts)) (16.1) 


Therefore, for each dependent variable you take each score and subtract from it the grand 
mean for that variable. This leaves you with two values per participant (one for each 
dependent variable), which should be multiplied together to get the cross-product for each 
participant. The total can then be found by adding the cross-products of all participants. 
Table 16.2 illustrates this process. 

The total cross-product is a gauge of the overall relationship between the two variables. 
However, we are also interested in how the relationship between the dependent variables 
is influenced by our experimental manipulation, and this relationship is measured by the 
model cross-product (CP M ). The CP M is calculated in a similar way to the model sum of 
squares. First, the difference between each group mean and the grand mean is calculated 
for each dependent variable. The cross-product is calculated by multiplying the differences 
found for each group. Each product is then multiplied by the number of scores within the 
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Table 16.2 Calculation of the total cross-product 


Group 

Actions 

Thoughts 

Actions 

—X grand (Actions) 

(D,) 

Thoughts 

—X grand(Thoughts) 

<D 2 ) 

D, x D 2 

CBT 

5 

14 

0.47 

-0.53 

-0.25 


5 

11 

0.47 

-3.53 

-1.66 


4 

16 

-0.53 

1.47 

-0.78 


4 

13 

-0.53 

-1.53 

0.81 


5 

12 

0.47 

-2.53 

-1.19 


3 

14 

-1.53 

-0.53 

0.81 


7 

12 

2.47 

-2.53 

-6.25 


6 

15 

1.47 

0.47 

0.69 


6 

16 

1.47 

1.47 

2.16 


4 

11 

-0.53 

-3.53 

1.87 

BT 

4 

14 

-0.53 

-0.53 

0.28 


4 

15 

-0.53 

0.47 

-0.25 


1 

13 

-3.53 

-1.53 

5.40 


1 

14 

-3.53 

-0.53 

1.87 


4 

15 

-0.53 

0.47 

-0.25 


6 

19 

1.47 

4.47 

6.57 


5 

13 

0.47 

-1.53 

-0.72 


5 

18 

0.47 

3.47 

1.63 


2 

14 

-2.53 

-0.53 

1.34 


5 

17 

0.47 

2.47 

1.16 

NT 

4 

13 

-0.53 

-1.53 

0.81 


5 

15 

0.47 

0.47 

0.22 


5 

14 

0.47 

-0.53 

-0.25 


4 

14 

-0.53 

-0.53 

0.28 


6 

13 

1.47 

-1.53 

-2.25 


4 

20 

-0.53 

5.47 

-2.90 


7 

13 

2.47 

-1.53 

-3.78 


4 

16 

-0.53 

1.47 

-0.78 


6 

14 

1.47 

-0.53 

-0.78 


5 

18 

0.47 

3.47 

1.63 

X grand 

4.53 

14.53 


CP T =J j N(D^xD 2 ) = -5.47 


group (as was done with the sum of squares). This principle is illustrated in the following 
equation and Table 16.3: 

CPm = JC j n [(^grp (Actions) grand (Actions) )(*grp (Thoughts) grand (Thoughts) 


(16.2) 
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Table 16.3 Calculating the model cross-product 



Xgroup 

Actions 

Xgroup — Xgrand 

<°i> 

Xgroup 

Thoughts 

Xgroup — Xgrand 

(D 2 ) 

D,xD 2 

N(D,xD ) 2 

CBT 

4.9 

0.37 

13.4 

-1.13 

-0.418 

-4.18 

BT 

3.7 

-0.83 

15.2 

0.67 

-0.556 

-5.56 

NT 

5.0 

0.47 

15.0 

0.47 

0.221 

2.21 

Xgrand 

4.53 


14.53 

o 

ii 

X A/ (D, x D 2 ) = -7.53 


Finally, we also need to know how the relationship between the two dependent vari¬ 
ables is influenced by individual differences in participants’ performances. The residual 
cross-product (CP R ) tells us about how the relationship between the dependent vari¬ 
ables is affected by individual differences, or error in the model. CP R is calculated in 
a similar way to the total cross-product, except that the group means are used rather 
than the grand mean (see equation (16.3)). So, to calculate each of the difference 
scores, we take each score and subtract from it the mean of the group to which it 
belongs (see Table 16.4): 


^-‘^R ^(Actions) x group (Actions) /(^'(Thoughts) Xgroup (Thoughts) ) (16.3) 

The observant among you may notice that the residual cross-product can also be calcu¬ 
lated by subtracting the model cross-product from the total cross-product: 


CP R - CPp - CP M 

= 5.47-(-7.53) = 13 


However, it is useful to calculate the residual cross-product manually in case of mistakes 
in the calculation of the other two cross-products. The fact that the residual and model 
cross-products should sum to the value of the total cross-product can be used as a useful 
double-check. 

Each of the different cross-products tells us something important about the relationship 
between the two dependent variables. Although I have used a simple scenario to keep the 
maths relatively simple, these principles can be easily extended to more complex scenar¬ 
ios. For example, if we had measured three dependent variables then the cross-products 
between pairs of dependent variables are calculated (as they were in this example) and 
entered into the appropriate SSCP matrix (see next section). As the complexity of the situ¬ 
ation increases, so does the amount of calculation that needs to be done. At times such as 
these the benefit of software like R becomes ever more apparent. 


16.4.3.4. The total SSCP matrix (T) (D 


In this example we have only two dependent variables, and so all of the SSCP matrices will 
be 2 x 2 matrices. If there had been three dependent variables then the resulting matrices 
would all be 3 x 3 matrices. The total SSCP matrix, T, contains the total sums of squares for 
each dependent variable and the total cross-product between the two dependent variables. 
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Table 16.4 Calculation of CP R 


Group 

Actions 

Actions 

—X group(Actions) 

(D,) 

Thoughts 

Thoughts 

—X group(Thoughts) 

<D 2 ) 

D, X D 2 

CBT 

5 

0.10 

14 

0.60 

0.06 


5 

0.10 

11 

-2.40 

-0.24 


4 

-0.90 

16 

2.60 

-2.34 


4 

-0.90 

13 

-0.40 

0.36 


5 

0.10 

12 

-1.40 

-0.14 


3 

-1.90 

14 

0.60 

-1.14 


7 

2.10 

12 

-1.40 

-2.94 


6 

1.10 

15 

1.60 

1.76 


6 

1.10 

16 

2.60 

2.86 


4 

-0.90 

11 

-2.40 

2.16 

Xcbt 

4.9 


13.4 


E = 0.40 

BT 

4 

0.30 

14 

-1.20 

-0.36 


4 

0.30 

15 

-0.20 

-0.06 


1 

-2.70 

13 

-2.20 

5.94 


1 

-2.70 

14 

-1.20 

3.24 


4 

0.30 

15 

-0.20 

-0.06 


6 

2.30 

19 

3.80 

8.74 


5 

1.30 

13 

-2.20 

-2.86 


5 

1.30 

18 

2.80 

3.64 


2 

-1.70 

14 

-1.20 

2.04 


5 

1.30 

17 

1.80 

2.34 

Xbt 

3.7 


15.2 


S = 22.60 

NT 

4 

-1.00 

13 

-2.00 

2.00 


5 

0.00 

15 

0 

0.00 


5 

0.00 

14 

-1.00 

0.00 


4 

-1.00 

14 

-1.00 

1.00 


6 

1.00 

13 

-2.00 

-2.00 


4 

-1.00 

20 

5.00 

-5.00 


7 

2.00 

13 

-2.00 

-4.00 


4 

-1.00 

16 

1.00 

-1.00 


6 

1.00 

14 

-1.00 

-1.00 


5 

0.00 

18 

3.00 

0.00 

Xnt 

5 


15 


Z= -10.00 





o 

ii 

1(0, xD 2 ) = 13 
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You can think of the first column and first row as representing one dependent variable and 
the second column and row as representing the second dependent variable: 



Column 1 Actions Column 2 Thoughts 

Row 1 Actions 

Row 2 Thoughts 

^(Actions) CP T 

CP QQ 

T T(Thoughts) 


We calculated these values in the previous sections and so we can simply place the appro¬ 
priate values in the appropriate cell of the matrix: 

( 61.47 5.47^ 

T = 

{ 5.47 141.47J 

From the values in the matrix (and what they represent) it should be clear that the total 
SSCP represents both the total amount of variation that exists within the data and the total 
co-dependence that exists between the dependent variables. You should also note that the 
off-diagonal elements are the same (they are both the total cross-product) because this 
value is equally important for both of the dependent variables. 


16.4.3.5. The residual SSCP matrix (E) ® 


The residual (or error) sum of squares and cross-product matrix, E, contains the residual 
sums of squares for each dependent variable and the residual cross-product between the 
two dependent variables. This SSCP matrix is similar to the total SSCP, except that the 
information relates to the error in the model: 



Column 1 Actions Column 2 Thoughts 

Row 1 Actions 

Row 2 Thoughts 

QQ pp 

^(Actions) R 

PP 

R ^ ^(Thoughts) 


We calculated these values in the previous sections and so we can simply place the appro¬ 
priate values in the appropriate cell of the matrix: 

(51 13h 

E = 

{13 122) 

From the values in the matrix (and what they represent) it should be clear that the residual 
SSCP represents both the unsystematic variation that exists for each dependent variable and 
the co-dependence between the dependent variables that is due to chance factors alone. As 
before, the off-diagonal elements are the same (they are both the residual cross-product). 


16.4.3.6. The model SSCP matrix (H) (D 


The model (or hypothesis) sum of squares and cross-product matrix, H, contains the model 
sums of squares for each dependent variable and the model cross-product between the two 
dependent variables: 
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Column 1 Actions Column 2 Thoughts 

Row 1 Actions 

Row 2 Thoughts 

QQ pp 

°M(Actions) M 

PP QC 

M ^(Thoughts) 


We calculated these values in the previous sections and so we can simply place the appro¬ 
priate values in the appropriate cell of the matrix: 


H = 


10.47 -7.53" 
-7.53 19.47, 


From the values in the matrix (and what they represent) it should be clear that the model 
SSCP represents both the systematic variation that exists for each dependent variable and 
the co-dependence between the dependent variables that is due to the model (i.e., is due 
to the experimental manipulation). As before, the off-diagonal elements are the same (they 
are both the model cross-product). 

Matrices are additive, which means that you can add (or subtract) two matrices together 
by adding (or subtracting) corresponding elements. Now, when we calculated univariate 
ANOVA we saw that the total sum of squares was the sum of the model sum of squares and 
the residual sum of squares (i.e., SS T = SS M + SS R ). The same is true in MANOVA except 
that we are adding matrices rather than single values: 

T = H + E 

f 10.47 -7.53") f 51 13" 

”[-7.53 19.47 J + [l3 122, 
f 10.47 + 51 -7.53+ 13" 

”[-7.53 + 13 19.47 + 122, 

_ "61.47 5.47" 

5.47 141.47, 


The demonstration that these matrices add up should (hopefully) help you to understand 
that the MANOVA calculations are conceptually the same as for univariate ANOVA - the 
difference is that matrices are used rather than single values. 


16 . 4 . 4 . 


Principle of the MANOVA test statistic © 


In univariate ANOVA we calculate the ratio of the systematic variance to the unsystematic 
variance (i.e., we divide SS M by SS R ). 3 The conceptual equivalent would therefore be to 
divide the matrix H by the matrix E. There is, however, a problem in that matrices are not 
divisible by other matrices. However, there is a matrix equivalent to division, which is to 
multiply by what’s known as the inverse of a matrix. So, if we want to divide H by E we 
have to multiply H by the inverse of E (denoted as E _1 ). So, therefore, the test statistic is 
based upon the matrix that results from multiplying the model SSCP with the inverse of the 
residual SSCP. This matrix is called HE~\ 

3 In reality we use the mean squares, but these values are merely the sums of squares corrected for the degrees of 
freedom. 
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Calculating the inverse of a matrix is incredibly difficult, and there is no need for you 
to understand how it is done because R will do it for you. However, the interested reader 
should consult either Stevens (2002) or Namboodiri (1984) - these texts provide very 
accessible accounts of how to derive an inverse matrix. For readers who do consult these 
sources, see Oliver Twisted. For the uninterested reader, you’ll have to trust me on the 
following: 


E” 1 

HE * 1 


0.0202 - 0 . 0021 " 
-0.0021 0.0084, 

0.2273 -0.0852" 
-0.1930 0.1794, 


Remember that HE -1 represents the ratio of systematic variance in the model to the 
unsystematic variance in the model, and so the resulting matrix is conceptually the same 
as the E-ratio in univariate ANOVA. There is another problem, though. In ANOVA, when 
we divide the systematic variance by the unsystematic variance we get a single figure: the 
E-ratio. In MANOVA, when we divide the systematic variance by the unsystematic variance 
we get a matrix containing several values. In this example, the matrix contains four values, 
but had there been three dependent variables the matrix would have had nine values. In 
fact, the resulting matrix will always contain p 1 values, where p is the number of depend¬ 
ent variables. The problem is how to convert these matrix values into a meaningful single 
value. This is the point at which we have to abandon any hope of understanding the maths 
behind the test and talk conceptually instead. 


16.4.4.1. Discriminant function variates © 


The problem of having several values with which to assess statistical significance can be 
simplified considerably by converting the dependent variables into underlying dimensions 
or factors (this process will be discussed in more detail in Chapter 17). In Chapter 7, we 
saw how multiple regression worked on the principle of fitting a linear model to a set of 
data to predict an outcome variable (the dependent variable in ANOVA terminology). 
This linear model was made up of a combination of predictor variables (or independ¬ 
ent variables) each of which had a unique contribution to this linear model. We can do a 
similar thing here, except that we are interested in the opposite problem (i.e., predicting 
an independent variable from a set of dependent variables). So, it is possible to calculate 
underlying linear dimensions of the dependent variables. These linear combinations of 
the dependent variables are known as variates (or sometimes called latent variables or 
factors). In this context we wish to use these linear variates to predict which group a per¬ 
son belongs to (i.e., whether they were given CBT, BT or no treatment), so we are using 
them to discriminate groups of people. Therefore, these variates are called discriminant 
functions or discriminant function variates. Although I have drawn a parallel between these 
discriminant functions and the model in multiple regression, there is a difference in that 
we can extract several discriminant functions from a set of dependent variables, whereas in 
multiple regression all independent variables are included in a single model. 

That’s the theory in simplistic terms, but how do we discover these discriminant func¬ 
tions? Well, without going into too much detail, we use a mathematical procedure of 
maximization, such that the first discriminant function (V ) is the linear combination of 
dependent variables that maximizes the differences between groups. 

It follows from this that the ratio of systematic to unsystematic variance (SS M /SS R ) 
will be maximized for this first variate, but subsequent variates will have smaller values 
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of this ratio. Remember that this ratio is an analogue of what the F-ratio represents 
in univariate ANOVA, and so in effect we obtain the maximum possible value of the 
F-ratio when we look at the first discriminant function. This variate can be described in 
terms of a linear regression equation (because it is a linear combination of the depend¬ 
ent variables): 

J, =b 0 + b,X lt + b 2 X 2i 
V u =b 0+ b t DV u+ b 2 DV 2l 

= b 0 + b x Actions, + 6 2 Thoughts ; (16-4-) 


Equation (16.4) shows the multiple regression equation for two predictors and then 
extends this to show how a comparable form of this equation can describe discriminant 
functions. The 6-values in the equation are weights (just as in regression) that tell us some¬ 
thing about the contribution of each dependent variable to the variate in question. In 
regression, the values of b are obtained by the method of least squares; in discriminant 
function analysis the values of b are obtained from the eigenvectors (see Jane Superbrain 
Box 16.2) of the matrix HE -1 . We can actually ignore b Q because this serves only to locate 
the variate in geometric space, which isn’t necessary when we’re using it to discriminate 
groups. 

In a situation in which there are only two dependent variables and two groups for the 
independent variable, there will be only one variate. This makes the scenario very simple: 
by looking at the discriminant function of the dependent variables, rather than looking at 
the dependent variables themselves, we can obtain a single value of SS M /SS R for the discri¬ 
minant function, and then assess this value for significance. However, in more complex 
cases where there are more than two dependent variables or more than three levels of the 
independent variable (as is the case in our example), there will be more than one variate. 
The number of variates obtained will be the smaller of p (the number of dependent vari¬ 
ables) and k—1 (where k is the number of levels of the independent variable). In our example, 
both p and k—1 are 2, so we should be able to find two variates. I mentioned earlier that 
the 6-values that describe the variates are obtained by calculating the eigenvectors of the 
matrix HE~ , and in fact there will be two eigenvectors derived from this matrix: one 
with the 6-values for the first variate, and one with the 6-values of the second variate. 
Conceptually speaking, eigenvectors are the vectors associated with a given matrix that are 
unchanged by transformation of that matrix to a diagonal matrix (look at Jane Superbrain 
Box 16.2 for a visual explanation of eigenvectors and eigenvalues). A diagonal matrix is 
simply a matrix in which the off-diagonal elements are zero and by changing HE~h o a 
diagonal matrix we eliminate all of the off-diagonal elements (thus reducing the number of 
values that we must consider for significance testing). Therefore, by calculating the eigen¬ 
vectors and eigenvalues, we still end up with values that represent the ratio of systematic 
to unsystematic variance (because they are unchanged by the transformation), but there are 
considerably less of them. The calculation of eigenvectors is extremely complex (insane 
students can consider reading Namboodiri, 1984), so you can trust me that for the matrix 
HF _1 the eigenvectors obtained are: 


eigenvector 


eigenvector 


f 0.603" 
[-0.335, 
"0.425" 
v 0.339. 
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JANE SUPERBRAIN 16.2 

What are eigenvectors and eigenvalues? @ 

The definitions and mathematics of eigenvalues and 
eigenvectors are very complicated and most of us 
need not worry about them (although they do crop up 
again in this chapter and the next). However, although 
the mathematics is hard, they are quite easy to visu¬ 
alize! Imagine we have two variables: the salary a 
supermodel earns in a year, and how attractive she is. 
Also imagine these two variables are normally distrib¬ 
uted and so can be considered together as a bivariate 


at 90 degrees, which means that they are independent 
of one another). So, with two variables, eigenvectors 
are just lines measuring the length and height of the 
ellipse that surrounds the scatterplot of data for those 
variables. If we add a third variable (e.g., experience of 
the supermodel) then all that happens is our scatterplot 
gets a third dimension, the ellipse turns into something 
shaped like a rugby ball (or American football), and 
because we now have a third dimension (height, width 
and depth) we get an extra eigenvector to measure this 
extra dimension. If we add a fourth variable, a similar 
logic applies (although it’s harder to visualize): we get 
an extra dimension, and an eigenvector to measure that 
dimension. Now, each eigenvector has an eigenvalue 
that tells us its length (i.e., the distance from one end of 
the eigenvector to the other). So, by looking at all of the 
eigenvalues for a data set, we know the dimensions of 
the ellipse or rugby ball: put more generally, we know 
the dimensions of the data. Therefore, the eigenvalues 
show how evenly (or otherwise) the variances of the 
matrix are distributed. 



normal distribution. If these variables are correlated, 
then their scatterplot forms an ellipse. This is shown in 
the scatterplots above: if we draw a dashed line around 
the outer values of the scatterplot we get something 
oval shaped. Now, we can draw two lines to measure 
the length and height of this ellipse. These lines are the 
eigenvectors of the original correlation matrix for these 
two variables (a vector is just a set of numbers that tells 
us the location of a line in geometric space). Note that 
the two lines we’ve drawn (one for height and one for 
width of the oval) are perpendicular: that is, they are 


In the case of two variables, the condition of the 
data is related to the ratio of the larger eigenvalue to the 
smaller. Let’s look at the two extremes: when there is no 
relationship at all between variables, and when there is 
a perfect relationship. When there is no relationship, the 
scatterplot will, more or less, be contained within a circle 
(or a sphere if we had three variables). If we again draw 
lines that measure the height and width of this circle we’ll 
find that these lines are the same length. The eigenvalues 
measure the length, therefore the eigenvalues will also 
be the same. So, when we divide the largest eigenvalue 


(Continued) 
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(Continued) 

by the smallest well get a value of 1 (because the eigen¬ 
values are the same). When the variables are perfectly 
correlated (i.e., there is perfect collinearity) then the scat- 
terplot forms a straight line and the ellipse surrounding it 
will also collapse to a straight line. Therefore, the height of 


the ellipse will be very small indeed (it will approach zero). 
Therefore, when we divide the largest eigenvalue by the 
smallest well get a value that tends to infinity (because 
the smallest eigenvalue is close to zero). Therefore, an 
infinite condition index is a sign of deep trouble. 


10- 



ro 


8 - 


4 - 


2 - 


0 - 1 —r 
0 



Beauty 



Beauty 


Replacing these values into the two equations for the variates and bearing in mind we 
can ignore b 0 , we obtain the models described in the following equation: 

V\i = b 0 + 0.603Actions, - 0.335Thoughts, 

V 2i = b 0 + 0.425Actions, + 0.339Thoughts, (16.5) 

It is possible to use the equations for each variate to calculate a score for each person 
on the variate. For example, the first participant in the CBT group carried out 5 obsessive 
actions, and had 14 obsessive thoughts. Therefore, this participant’s score on variate 1 
would be —1.675: 

Vj =(0.603x5)-(0.335xl4) = -1.675 

The score for variate 2 would be 6.87: 

V 2 = (0.425 x 5) + (0.339 x 14) = 6.871 

If we calculated these variate scores for each participant and then calculated the SSCP 
matrices (e.g., H, E, T and HE -1 ) that we used previously, we would find that all of them 
have cross-products of zero. The reason for this is that the variates extracted from the data 
are orthogonal, which means that they are uncorrelated. In short, the variates extracted are 
independent dimensions constructed from a linear combination of the dependent variables 
that were measured. 

This data reduction has a very useful property in that if we look at the matrix HE _1 cal¬ 
culated from the variate scores (rather than the dependent variables) we find that all of the 
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off-diagonal elements (the cross-products) are zero. The diagonal elements of this matrix 
represent the ratio of the systematic variance to the unsystematic variance (i.e., SS M /SS R ) 
for each of the underlying variates. So, for the data in this example, this means that instead 
of having four values representing the ratio of systematic to unsystematic variance, we 
now have only two. This reduction may not seem a lot. However, in general if we have p 
dependent variables, then ordinarily we would end up with p 2 values representing the ratio 
of systematic to unsystematic variance; by looking at discriminant functions, we reduce 
this number to p. If there were four dependent variables we would end up with four values 
rather than 16 (which highlights the benefit of this process). 

For the data in our example, the matrix HE ^calculated from the variate scores is: 

, ( 0.335 0.000") 

HE -1 = 

vamtes ^0.000 0.073, 

It is clear from this matrix that we have two values to consider when assessing the sig¬ 
nificance of the group differences. It probably seems like a complex procedure to reduce 
the data down in this way: however, it transpires that the values along the diagonal of the 
matrix for the variates (namely 0.335 and 0.073) are the eigenvalues of the original HE -1 
matrix. Therefore, these values can be calculated directly from the data collected without 
first forming the eigenvectors. If you have lost all sense of rationality and want to see how 
these eigenvalues are calculated then see Oliver Twisted. These eigenvalues are conceptu¬ 
ally equivalent to the E-ratio in ANOVA and so the final step is to assess how large these 
values are compared to what we would expect by chance alone. There are four ways in 
which the values are assessed. 



OLIVER TWISTED 

Please Sir, can I have some 
more ... maths? 


‘You are a bit stupid. I think it would be fun to check your maths so that 
we can see exactly how much of a village idiot you are’, mocks Oliver. 
Luckily you can. Never one to shy from public humiliation on a mass 
scale, I have provided the matrix calculations for this example on the 
companion website. Find a mistake, go on, you know that you can ... 


16.4.4.2. Pillai-Bartlett trace (V) © 

The Pillai-Bartlett trace (also known as Pillai’s trace) is given by 


S 

v = Z 

i=l 


1 + hj 


(16.6) 


in which X represents the eigenvalues for each of the discriminant variates and s represents 
the number of variates. Pillai’s trace is the sum of the proportion of explained variance 
on the discriminant functions. As such, it is similar to the ratio of SS M /SS^, which is known 
as R 2 . 

For our data, Pillai’s trace turns out to be 0.319, which can be transformed to a value 
that has an approximate E-distribution: 


0.335 0.073 

1 + 0.335 1 + 0.073 


0.319 
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16.4.4.3. Hotelling’s V © 


The Hotelling-Lawley trace (also known as Hotelling’s T 2 ; Figure 16.2) is simply the sum of 
the eigenvalues for each variate: 


T = 2 >, 

i=l 


(16.7) 


So for these data its value is 0.408 (0.335 + 0.073). This test statistic is the sum of SS M /SS R 
for each of the variates and so it compares directly to the F-ratio in ANOVA. 


FIGURE 16.2 

Harold Hotelling 
enjoying my 
favourite activity of 
drinking tea 



16.4.4.4. Wilks’s lambda (A) © 

Wilks’s lambda is the product of the unexplained variance on each of the variates: 


A=n 

i =1 


1 

l+k ( 


(16.8) 


The 11 symbol is similar to the summation symbol (X) that we have encountered already 
except that it means multiply rather than add up. So, Wilks’s lambda represents the ratio 
of error variance to total variance (SS R /SS T ) for each variate. 

For the data in this example the value is 


A= 


1 + 0.335 A1+ 0.073 


0.698 


and it should be clear that large eigenvalues (which in themselves represent a large experi¬ 
mental effect) lead to small values of Wilks’s lambda - hence statistical significance is found 
when Wilks’s lambda is small. 








CHAPTER 16 MULTIVARIATE ANALYSIS OF VARIANCE (MANOVA) 


717 


16.4.4.5. Roy’s largest root © 


Roy’s largest root always makes me think of some bearded statistician with a garden spade 
digging up an enormous parsnip (or similar root vegetable); however, it isn’t a parsnip but, 
as the name suggests, is the eigenvalue for the first variate. So, in a sense it is the same as 
the Hotelling-Lawley trace but for the first variate only, that is: 

0 = ^largest (16.9) 


As such, Roy’s largest root represents the proportion of explained variance to unexplained 
variance (SS M /SS R ) for the first discriminant function. 4 For the data in this example, the 
value of Roy’s largest root is simply 0.335 (the eigenvalue for the first variate). So, this 
value is conceptually the same as the F-ratio in univariate ANOVA. It should be apparent, 
from what we have learnt about the maximizing properties of these discriminant variates, 
that Roy’s root represents the maximum possible between-group difference given the data 
collected. Therefore, this statistic should in many cases be the most powerful. 



16.5. Practical issues when conducting MANOVA © 


There are three main practical issues to be considered before running MANOVA. First of 
all, as always, we have to consider the assumptions of the test. Next, for the main analysis 
there are four commonly used ways of assessing the overall significance of a MANOVA, and 
debate exists about which method is best in terms of power and sample size considerations. 
Finally, we also need to think about what analysis to do after the MANOVA: like ANOVA, 
MANOVA is a two-stage test in which an overall (or omnibus) test is first performed before 
more specific procedures are applied to tease apart group differences. As you will see, there 
is substantial debate over how best to further analyse and interpret group differences when 
the overall MANOVA is significant. We will look at these issues in turn. 


16.5.1. 


Assumptions and how to check them ® 


MANOVA has similar assumptions to ANOVA but extended to the multivariate case: 

• Independence: Observations should be statistically independent. 

• Random sampling: Data should be randomly sampled from the population of interest 
and measured at an interval level. 

• Multivariate normality: In ANOVA, we assume that our dependent variable is normally 
distributed within each group. In the case of MANOVA, we assume that the depen¬ 
dent variables (collectively) have multivariate normality within groups. 

• Homogeneity of covariance matrices: In ANOVA, it is assumed that the variances in 
each group are roughly equal (homogeneity of variance). In MANOVA we must 
assume that this is true for each dependent variable, but also that the correlation 
between any two dependent variables is the same in all groups. This assumption is 
examined by testing whether the population variance-covariance matrices of the dif¬ 
ferent groups in the analysis are equal. 5 

4 This statistic is sometimes characterized as A.j ar g est /(1 +^i ar g est )^ ut this ls not the statistic reported by R. 

5 For those of you who read about SSCP matrices, if you think about the relationship between sums of squares 
and variance, and cross-products and correlations, it should be clear that a variance-covariance matrix is basically 
a standardized form of an SSCP matrix. 
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Most of the assumptions can be checked in the same way as for univariate tests (see 
Chapter 10); the additional assumptions of multivariate normality and equality of covari¬ 
ance matrices require different procedures. The assumption of multivariate normality can 
be tested using R with a test known as the Shapiro test that we used to test for univariate 
normality; however, this version of the test looks for multivariate normality. We can also 
look at some graphical displays of multivariate outliers produced by the aq.plot() function 
of the mvoutlier package. 

The assumption of equality of covariance matrices is often tested using Box’s test. This 
test should be non-significant if the matrices are the same. The effect of violating this 
assumption is unclear, except that Hotelling’s T 2 is robust in the two-group situation when 
sample sizes are equal (Hakstian, Roed, & Lind, 1979). Box’s test is notoriously susceptible 
to deviations from multivariate normality and so can be non-significant not because the 
matrices are similar, but because the assumption of multivariate normality is not tenable. 
Also, as with any significance test, in large samples Box’s test could be significant even 
when covariance matrices are relatively similar. As a general rule, if sample sizes are equal 
then people tend to disregard Box’s test, because (1) it is unstable, and (2) in this situation 
we can assume that Hotelling’s and Pillai’s statistics are robust (see section 16.5.2). For 
these reasons Box’s test has yet to be implemented in R because it is of questionable use 
given its inaccuracy. 

However, if group sizes are different, then robustness of the MANOVA cannot be 
assumed. The more dependent variables you have measured, and the greater the differ¬ 
ences in sample sizes, the more distorted the probability values become. Tabachnick and 
Fidell (2007) suggest that if the larger samples produce greater variances and covariances 
then the probability values will be conservative (and so significant findings can be trusted). 
However, if it is the smaller samples that produce the larger variances and covariances 
then the probability values will be liberal and so significant differences should be treated 
with caution (although non-significant effects can be trusted). Therefore, the variance- 
covariance matrices for samples should be inspected to assess whether the printed prob¬ 
abilities for the multivariate test statistics are likely to be conservative or liberal. In the 
event that you cannot trust the printed probabilities, there is little you can do except equal¬ 
ize the samples by randomly deleting cases in the larger groups (although with this loss of 
information comes a loss of power). Of course, if you like a belt and braces approach, you 
can always check your results by using a robust MANOVA too. 


16.5.2. 


Choosing a test statistic ® 



Which test statistic 
should I use? 


Only when there is one underlying variate will the four test statistics necessar¬ 
ily be the same. Therefore, it is important to know which test statistic is best 
in terms of test power and robustness. A lot of research has investigated the 
power of the four MANOVA test statistics (Olson, 1974, 1976, 1979; Stevens, 
1980). Olson (1974) observed that for small and moderate sample sizes the 
four statistics differ little in terms of power. If group differences are concen¬ 
trated on the first variate (as will often be the case in social science research) 
Roy’s statistic should prove most powerful (because it takes account of only 
that first variate), followed by Hotelling’s trace, Wilks’s lambda and Pillai’s 
trace. However, when groups differ along more than one variate, the power 
ordering is the reverse (i.e., Pillai’s trace is most powerful and Roy’s root is least). One 
final issue pertinent to test power is that of sample size and the number of dependent 
variables. Stevens (1980) recommends using fewer than 10 dependent variables unless 
sample sizes are large. 
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In terms of robustness, all four test statistics are relatively robust to violations of multi¬ 
variate normality (although Roy’s root is affected by platykurtic distributions - see Olson, 
1976). Roy’s root is also not robust when the homogeneity of covariance matrix assumption 
is untenable (Stevens, 1979). The work of Olson and Stevens led Bray and Maxwell (1985) 
to conclude that when sample sizes are equal the Pillai-Bartlett trace is the most robust to 
violations of assumptions. However, when sample sizes are unequal this statistic is affected 
by violations of the assumption of equal covariance matrices. As a rule, with unequal group 
sizes, check the homogeneity of covariance matrices; if they seem homogeneous and if the 
assumption of multivariate normality is tenable, then assume that Pillai’s trace is accurate. 


16.5.3. 


Follow-up analysis (D 


There is some controversy over how best to follow up the main MANOVA. The traditional 
approach is to follow a significant MANOVA with separate ANOVAs on each of the depen¬ 
dent variables. If this approach is taken, you might well wonder why we bother with the 
MANOVA in the first place (earlier on I said that multiple ANOVAs were a bad thing to do). 
Well, the ANOVAs that follow a significant MANOVA are said to be ‘protected’ by the initial 
MANOVA (Bock, 1975). The idea is that the overall multivariate test protects against inflated 
Type I error rates because if that initial test is non-significant (i.e., the null hypothesis is true) 
then any subsequent tests are ignored (any significance must be a Type I error because the null 
hypothesis is true). However, the notion of protection is somewhat fallacious because a sig¬ 
nificant MANOVA, more often than not, reflects a significant difference for one, but not all, 
of the dependent variables. Subsequent ANOVAs are then carried out on all of the dependent 
variables, but the MANOVA protects only the dependent variable for which group differ¬ 
ences genuinely exist (see Bray and Maxwell, 1985, pp. 40^41). Therefore, you might want 
to consider applying a Bonferroni correction to the subsequent ANOVAs (Harris, 1975). 

By following up a MANOVA with ANOVAs you assume that the significant MANOVA is 
not due to the dependent variables representing a set of underlying dimensions that differ¬ 
entiate the groups. Therefore, some researchers advocate the use of discriminant analysis, 
which finds the linear combination(s) of the dependent variables that best separates (or 
discriminates) the groups. This procedure is more in keeping with the ethos of MANOVA 
because it embraces the relationships that exist between dependent variables and it is cer¬ 
tainly useful for illuminating the relationship between the dependent variables and group 
membership. The major advantage of this approach over multiple ANOVAs is that it 
reduces and explains the dependent variables in terms of a set of underlying dimensions 
thought to reflect substantive theoretical dimensions. We will consider both approaches. 


16.6. MANOVA using R (D 


In the remainder of this chapter we will use the OCD data to illustrate how MANOVA is 
done (those of you who skipped the theory section should refer to Table 16.1). 


16.6.1. 


Packages for factorial ANOVA in R © 


You will need the packages car (for looking at Type III sums of squares), ggplotl (for graphs), 
MASS (for discriminant function analysis), mvoutlier (for plots to look for multivariate 
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outliers), mvnormtest (to test for multivariate normality), pastecs (for descriptive statistics), 
reshape (for reshaping the data) and WRS (for robust tests). The MASS package is automati¬ 
cally installed, but you can install any of the others that you don’t already have by execut¬ 
ing the following commands: 

install.packages("car"); install.packages("ggplot2"); install. 
packages("mvoutlier"); install.packages("mvnormtest"); install. 
packages("pastecs"); install.packages("reshape"); install. 
packages("WRS", repos="http://R-Forge.R-project.org") 

You then need to load these packages by executing these commands: 

library(car); library(ggplot2); libraryCMASS); library(mvoutlier); 
library(mvnormtest); library(pastecs); library(reshape); library(WRS) 


16.6.2. 


General procedure for MANOVA © 


To conduct factorial MANOVA you should follow this general procedure: 

1 Enter data. 

2 Explore your data: begin by graphing the data and computing descriptive statistics. 
You should check multivariate normality and take a look at the variance-covariance 
matrices for each group. 

3 Set contrasts for all predictor variables: you need to decide what contrasts to do and 
to specify them appropriately for all of the independent variables in your analysis. 

4 Compute the MANOVA: you can then run the main multivariate analysis of variance. 
Depending on what you found in the previous step, you might need to run a robust 
version of the test. 

5 Run univariate ANOVAs: having conducted the MANOVA, you can follow it up with 
separate ANOVAs for each dependent variable. 

6 Discriminant function analysis: better than the option above, consider running a 
discriminant function analysis. 

We will work through these steps in turn. 


16.6.3. 


MANOVA using R Commander © 



You cannot directly do a MANOVA using R Commander. It’s not all bad news, though, 
because if you’ve reached this point in the book without giving up or hurling yourself 
or the book out of the window, then MANOVA will be easy: the commands are pretty 
straightforward compared to some of the things we’ve covered. 


16.6.4. 


Entering the data © 


The data for the example can be found in the file OCD.dat. You can load this data file by 
setting your working directory and executing: 


ocdData<-read.delim("OCD.dat", header = TRUE) 
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If we look at the data (by executing ocdData) we will see that it has been entered in ‘wide’ 
format; that is, levels of a between-group variable go in a single column. 





Group 

Actions 

Thoughts 

1 



CBT 

5 

14 

2 



CBT 

5 

11 

3 



CBT 

4 

16 

4 



CBT 

4 

13 

5 



CBT 

5 

12 

6 



CBT 

3 

14 

7 



CBT 

7 

12 

8 



CBT 

6 

15 

9 



CBT 

6 

16 

10 



CBT 

4 

11 

11 



BT 

4 

14 

12 



BT 

4 

15 

13 



BT 

1 

13 

14 



BT 

1 

14 

15 



BT 

4 

15 

16 



BT 

6 

19 

17 



BT 

5 

13 

18 



BT 

5 

18 

19 



BT 

2 

14 

20 



BT 

5 

17 

21 

No 

Treatment 

Control 

4 

13 

22 

No 

Treatment 

Control 

5 

15 

23 

No 

Treatment 

Control 

5 

14 

24 

No 

Treatment 

Control 

4 

14 

25 

No 

Treatment 

Control 

6 

13 

26 

No 

Treatment 

Control 

4 

20 

27 

No 

Treatment 

Control 

7 

13 

28 

No 

Treatment 

Control 

4 

16 

29 

No 

Treatment 

Control 

6 

14 

30 

No 

Treatment 

Control 

5 

18 


These data were originally entered in Excel, and, as you can see, we have a coding vari¬ 
able to represent the treatment condition. Therefore, in Excel, I created a variable called 
Group into which I typed ‘CBT’, ‘BT’ or ‘No Treatment Control’; because I have used 
words rather than numbers, when R imports the data it guesses that this variable is a factor 
(i.e., we don’t need to explicitly convert it to a factor). It will treat the order of categories 
as alphabetic; in other words the factor levels are treated as BT, CBT and No Treatment 
Control rather than the order they were entered into Excel (which was CBT, BT, No 
Treatment Condition). Let’s reorder the levels so that the order matches the original data 
using the levels option of the factor() function. While we’re at it, I want to rename ‘No 
Treatment Control’ as ‘NT’ for various reasons. We can do this using the labels option of 
the factorQ function. 

ocdData$Group<-factor(ocdData$Group, levels = c("CBT", "BT", "No Treatment 
Control"), labels = c("CBT", "BT", "NT")) 

By executing the above command we take the Group variable from the ocdData dataframe 
and reorder the levels as ‘CBT’, ‘BT’ and ‘No Treatment Control’ ( levels = c(“CBT”, “BT”, 
“No Treatment Control”)). We then relabel these levels as ‘CBT’, ‘BT’ and ‘NT’ ( labels = 
c(“CBT”, “BT”, “NT”)). 

The scores for each outcome measure are stored in two columns labelled Actions and 
Thoughts. From this, we can tell, for example, that participant 15 had behaviour therapy 
(BT) and had four obsession-related actions and 15 obsession-related thoughts. 
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If we wanted to enter the data directly into R, we would create a coding variable called 
Group using the gl() function (Chapter 3). This function creates a coding variable based 
on the number of groups you want and how many cases are in each group. You can use 
the labels option to list names for each group. For Group, we want three treatment groups 
each containing 10 participants, so we can specify it as: 

Group<-gl(3, 10, labels = c("CBT", "BT", "NT")) 

The numbers in the function tell R that we want three groups of 10 cases, and the labels 
option then specifies the names to attach to these three groups. 

We can create numeric variables containing the number of obsession-related Actions and 
Thoughts in the usual way: 

Actions<-c(5, 5, 4, 4, 5, 3, 7, 6, 6, 4, 4, 4, 1, 1, 4, 6, 5, 5, 2, 5, 4, 5, 
5, 4, 6, 4, 7, 4, 6, 5) 

Thoughts<-c(14, 11, 16, 13, 12, 14, 12, 15, 16, 11, 14, 15, 13, 14, 15, 19, 
13, 18, 14, 17, 13, 15, 14, 14, 13, 20, 13, 16, 14, 18) 

Finally, we can merge these variables into a dataframe called ocdData by executing: 

ocdData<-data.frame(Group, Actions, Thoughts) 


16.6.5. 


Exploring the data (D 


Let’s start by looking at the relationship between thoughts and actions for the different 
conditions. The resulting plot (Figure 16.3) shows no relationship between obsession- 
related thoughts and behaviours in the CBT group, a positive relationship in the BT group 
and a negative relationship in the NT group. 




SELF-TEST 

s Use ggplot2 to plot a scatterplot of the number of 
obsession-related actions (x-axis) against obsession- 
related thoughts (/-axis) for each treatment group 
(as separate panels). 


FIGURE 16.3 

Scatterplot of 
the relationship 
between 
obsession- 
related thoughts 
and actions in 
different treatment 
conditions 
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Let’s look now at the mean number of obsession-related thoughts and behaviours across 
the three groups. 



SELF-TEST 

s Use ggplot2 to plot a bar graph (with error bars) 
of the treatment group on the x-axis and different- 
coloured bars to represent the mean number of 
obsession-related thoughts and behaviours. 



Figure 16.4 shows the resulting plot. For actions, BT appears to reduce the number of 
obsessive behaviours compared to CBT and NT. For thoughts, CBT reduces the number of 
obsessive thoughts compared to BT and NT. 



FIGURE 16.4 

Error bar chart 
showing the 
mean numbers of 
obsession-related 
thoughts and 
actions across the 
different treatment 
conditions 


CBT 


BT 

Treatment Group 


NT 


Finally, we can also look at boxplots to see the distribution of scores for the number of 
obsession-related thoughts and actions across the different treatment groups. 



SELF-TEST 

s Use ggplot2 to plot boxplots of treatment group 
on the x-axis and obsession-related thoughts and 
actions displayed on they-axis (in different colours). 



Figure 16.5 shows the resulting graph. It is fairly clear that the range and distribution 
of scores are reasonably similar across groups and across measures (all of the boxes and 
whiskers are a similar vertical length. The only noteworthy point really is that there is some 
evidence of an outlier in the no-treatment group (for Thoughts) and, in the same group, 
scores for Actions seem like they might be a little skewed (there is no lower tail). 
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FIGURE 16.5 

Boxplots of the 
OCD data 
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NT 


Outcome Measure 


$ 

$ 


Actions 

Thoughts 


Next, we can use the by() function and the stat.desc() function in the pastecs package to 
get descriptive statistics for separate groups (see Chapter 5 for more detail). We execute 
separate commands for Thoughts and Actions: 

by(ocdData$Actions, ocdData$Group, stat.desc, basic = FALSE) 
by(ocdData$Thoughts, ocdData$Group, stat.desc, basic = FALSE) 

The resulting output for Actions (Output 16.1) and Thoughts (Output 16.2) corresponds 
to the values calculated by hand in Table 16.1 and shows much the same as Figure 16.4: 
that BT seemed to lower behaviours compared to the other groups whereas CBT resulted 
in lower numbers of thoughts than the other groups. 

ocdData$Group: CBT 


median mean 

5.000 4.900 

SE.mean 

0.379 

Cl.mean.0.95 

0.856 

var 

1.433 

std.dev 

1.197 

coef.var 

0.244 

ocdData$Group: 
median mean 

4.000 3.700 

BT 

SE.mean 

0.559 

Cl.mean.0.95 

1.264 

var 

3.122 

std.dev 

1.767 

coef.var 

0.478 

ocdData$Group: 
median mean 

5.000 5.000 

NT 

SE.mean 

0.333 

Cl.mean.0.95 

0.754 

var 

1.111 

std.dev 

1.054 

coef.var 

0.211 

Output 16.1 






ocdData$Group: 
median mean 

13.500 13.400 

CBT 

SE.mean 

0.600 

Cl.mean.0.95 

1.357 

var 

3.600 

std.dev 

1.897 

coef.var 

0.142 

ocdData$Group: 
median mean 

14.500 15.200 

BT 

SE.mean 

0.663 

Cl.mean.0.95 

1.501 

var 

4.400 

std.dev 

2.098 

coef.var 

0.138 

ocdData$Group: 
median mean 

14.000 15.000 

NT 

SE.mean 

0.745 

Cl.mean.0.95 

1.686 

var 

5.556 

std.dev 

2.357 

coef.var 

0.157 


Output 16.2 

Having looked at the data in summary form, we can start to look at assumptions. To 
check the homogeneity of covariance matrices we don’t do a formal test, but simply 
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compare the values within them. To get the variance-covariance matrices for each group 
we can again use the by() function but in combination with the cov() function, which can 
be used to print the covariance matrix to the console. 

by(ocdData[, 2:3], ocdData$Group, cov) 

The above command takes columns 2 and 3 of the ocdData dataframe ( ocdData[, 2:3]), 
which means that we’re selecting the columns that contain the variables Actions and 
Thoughts. The command then applies the function cov() to these columns, but splits the 
output by the variable Group (ocdData$Group). 

Output 16.3 shows the variance-covariances matrices for each group. The diagonal 
elements represent the variances for each outcome measure and the off-diagonals are 
the covariances (i.e., the relationship between thoughts and actions). The variances for 
actions are a little different across groups (1.43, 3.12 and 1.11), with the largest variance 
being nearly three times as big as the smallest. The variances for thoughts are really quite 
similar (3.60, 4.40, and 5.56), with a variance ratio (the largest variance relative to small¬ 
est) of about 1.5, which is below the threshold of 2. Looking at the covariances, these 
are also reasonably different (0.04, 2.51, and -1.11) reflecting the different relationships 
between thoughts and actions across the groups that we saw in Figure 16.3. On balance, 
there is evidence to suggest that the matrices are different across groups; however, given 
the group sizes are equal we probably don’t need to worry too much about these dif¬ 
ferences. However, if we had different group sizes then remember that: (1) if the larger 
samples produce greater variances and covariances then the probability values will be 
conservative (and so significant findings can be trusted); and (2) if it is the smaller sam¬ 
ples that produce the larger variances and covariances then the probability values will be 
liberal and so significant differences in the MANOVA should be treated with caution. In 
any case, for the current data it would be sensible to carry out a robust analysis as well 
as the normal one. 

ocdData$Group: CBT 

Actions Thoughts 
Actions 1.43333333 0.04444444 
Thoughts 0.04444444 3.60000000 


ocdData$Group: BT 

Actions Thoughts 
Actions 3.122222 2.511111 
Thoughts 2.511111 4.400000 


ocdData$Group: NT 

Actions Thoughts 
Actions 1.111111 -1.111111 
Thoughts -1.111111 5.555556 

Output 16.3 

The final assumption that we need to test is multivariate normality. We can do this using 
the mshapiro.test() function in the mvnormtest package. We need to apply this test to the 
groups individually, so the first thing to do is to extract the data for each group. We learnt 
how to do this in section 3.9.1. For example, to get the CBT group data we could execute: 

cbt<-ocdData[l:10, 2:3] 

This command creates a variable called cbt that is a subset of the ocdData dataframe. The 
square brackets indicate that we want a selection of the data, the 1:10 indicates the rows 
that we want to select (i.e., rows 1 to 10 inclusive), and 2:3 indicates the columns that we 
want to select (i.e., columns 2 and 3). This gives us the following data: 
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Actions Thoughts 


1 5 14 

2 5 11 

3 4 16 

4 4 13 

5 5 12 

6 3 14 

7 7 12 

8 6 15 

9 6 16 

10 4 11 


The mshapiro.test() function needs these data in a format such that actions and thoughts 
appear in rows rather than columns, and the participants appear in columns rather than 
rows. Fortunately this can be done easily using the transpose function t(). This function 
simply transposes the rows and columns; so, by executing: 

cbtc-t(cbt) 

we change the variable cbt so that it is a transposed version of the original variable: 

123456789 10 
Actions 5544537664 
Thoughts 14 11 16 13 12 14 12 15 16 11 

We can do the same for the BT and NT groups. However, it is quicker if we do the trans¬ 
pose at the same time as creating the variable (for cbt I split the process into two stages only 
so you could see what was happening). Therefore, executing: 

bt<-t(ocdData[ll:20, 2:3]) 
nt<-t(ocdData[21:30, 2:3]) 

creates a variable bt, which is rows 11 to 20 and columns 2 and 3 of the original dataframe, 
and nt, which is rows 21 to 30 and columns 2 and 3 of the original dataframe. In both 
cases we apply the transform function, t(), to get the extracted data in the correct format 
for mshapiro.test(). To apply the test, we simply execute the function on each of the three 
variables that we have just created: 

mshapiro.test(cbt) 
mshapiro.test(bt) 
mshapiro.test(nt) 

Output 16.4 shows the results of the three tests: if the p value is less than .05 then our 
data deviate from multivariate normality. It’s clear that for the CBT (p = .111) and BT (p = 
.175) groups there is no problem because both results are non-significant; however, for the 
NT group (p = .03) the data deviate significantly from multivariate normality. 


> mshapiro.test(cbt) 

Shapiro-Wilk normality 
data: Z 

W = 0.9592, p-value = 0.7767 

> mshapiro.test(bt) 

Shapiro-Wilk normality 


data: Z 

W = 0.8912, p-value = 0.175 
> mshapiro.test(nt) 

Shapiro-Wilk normality 


data: Z 


test 


test 


test 


W = 0.826, p-value = 0.02998 


Output 16.4 
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Outliers based on 97.5% quantile 




Ordered squared robust distance 


FIGURE 16.6 

AQ plot of the OCD 
data 


Outliers based on adjusted quantile 



We can also look for multivariate outliers using the aq.plot() function from the mvoutlier 
package. All we need to do is enter the columns of the dataframe containing the outcome 
measures into the function. For example, executing: 

aq.plot(ocdData[, 2:3]) 

will produce the plot for the current data; remember that ocdData[, 2:3] means ‘select all of the 
rows (because there is nothing before the comma) but only columns 2 to 3 (because we have 
specified 2:3 after the comma)’. In other words, we’re selecting only the variables Actions (col¬ 
umn 2) and Thoughts (column 3). The resulting plot is shown in Figure 16.6. These plots show 
the case numbers (i.e., the row number in the dataframe) and you need to look for values in red 
(or, because this book isn’t in colour, blue in Figure 16.6) in all but the top right graph. You can 
see that row 26 might be an outlier. In the top right plot, you are looking for any cases that fall 
to the right of the vertical line labelled 97.5% Quantile. Again, row 26 of the dataframe has been 
identified. These plots, therefore, suggest that row 26 might be an outlier. You could consider 
deleting this case to see if it makes the data multivariate normal, or leave the case in and conduct 
a robust MANOVA to combat the effects of the outlier. 



SELF-TEST 

s Delete case 26 from the dataframe and redo the 
Shapiro test of multivariate normality. 
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16.6.6. 


Setting contrasts © 


One way to follow up a MANOVA is to look at individual univariate ANOVAs for each 
dependent variable. For these tests, you can specify contrasts just as we have in several 
other chapters (see, for example, section 10.6.7). For this example it makes sense to com¬ 
pare each of the treatment groups to the no-treatment control group. This is the treatment 
contrast described in Table 10.6. The no-treatment control group was coded as the last 
category, so we could set this contrast by executing: 

contrasts(ocdData$Group)<-contr.treatment(3, base = 3) 

The contrasts(ocdData$Group) just tells R that we want to set the contrast for the Group 
variable, then contr.treatment sets the contrast to be a treatment contrast. The 3 indicates 
that Group has three levels, and base = 3 sets level 3 (i.e., NT) as the baseline category. 

I like to set the contrasts manually so that I can give them names that will help me to 
interpret the output, so alternatively, we could set the contrasts by executing: 

CBT_vs_NT<-c(l, 0, 0) 

BT_vs_NT <-c(0, 1, 0) 

contrasts(ocdData$Group)<-cbind(CBT_vs_NT, BT_vs_NT) 

One important point here is that we’re using a non-orthogonal contrast, which means that 
we can’t look at Type III sums of squares because their computation requires orthogonal 
contrasts. However, we have only one predictor (Group) so this doesn’t matter, because 
the Type I sums of squares produced by R will be the same as the Type III when there is 
only one variable in the model (refer back to Jane Superbrain Box 11.1 for an explanation 
of why). 


16.6.7. 


The MANOVA model © 


To create a MANOVA model we use the manova() function, which is just the lm() function 
in disguise. Therefore, we can use what we learnt in Chapter 7 to understand how the func¬ 
tion works. The function takes exactly the same form as aov(), which we used in Chapter 
10. It has the general form: 

newModel<-manova(outcome ~ predictor(s), data = dataFrame, na.action = an 
action)) 

in which: 

• newModel is an object created that contains information about the model. We can 
get summary statistics for this model by executing summary (newModel) for the main 
MANOVA summary. 

• outcome is a single object containing the variables that you’re trying to predict (i.e., 
the dependent variables). In this example it will be Actions and Thoughts. 

• predictor(s) lists the variable or variables from which you’re trying to predict the 
outcome variables (i.e., the independent variable(s)). In this example it will be the 
variable Group. In more complex designs we can specify several predictors or inde¬ 
pendent variables, just as we have in previous chapters. 

• dataFrame is the name of the dataframe from which your outcome and predictor 
variables come. 
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• na. act ion is an optional command. If you have complete data (as we have here) you 
can ignore it, but if you have missing values (i.e., NAs in the dataframe) then it can be 
useful to use na.action = na.exclude, which will exclude all cases with missing values). 

As with most of the models in this book, we specify a model in the function of the 
form ‘outcome ~ predictor(s)’. In the case of MANOVA there are several outcomes, so 
the model becomes ‘outcomes ~ predictor(s)’. To put multiple outcomes into the model, 
we have to bind the variables together into a single entity using the cbind() function that 
we have encountered many times before. In the current example, we want to combine 
Thoughts and Actions, so we can create a single outcome object by executing: 

outcome<-cbind(ocdData$Actions, ocdData$Thoughts) 

This command creates an object called outcome, which contains the Actions and Thoughts 
variables of the ocdData dataframe pasted together in columns. We use this new object 
as the outcome in our model, and specify any predictors as we have in previous chapters. 
Therefore, for this example, we could estimate the model by executing: 

ocdModel<-manova(outcome ~ Group, data = ocdData) 

This command creates a model called ocdModel, which predicts the object called outcome 
(which, remember, includes the variables Thoughts and Actions) from the independent 
variable Group. If you had several independent variables you could add them in by using 
a plus symbol (remember to also add in the interaction), for example, ‘outcome ~ Group 
+ IV2 + Group:IV2’, or by using an asterisk to automatically include all main effects and 
interactions, for example, ‘outcome ~ Group*IV2’. 

To see the output of the model we use the summary command; by default, R pro¬ 
duces Pillai’s trace (which is a sensible choice), but we can see the other test statistics 
by including the test = option. For example, to see all four test statistics we would need 
to execute: 

summaryCocdModel, intercept = TRUE) 

summaryCocdModel, intercept = TRUE, test = "Wilks") 

summaryCocdModel, intercept = TRUE, test = "Hotelling") 

summaryCocdModel, intercept = TRUE, test = "Roy") 

The first command produces Pillai’s trace (because test = is omitted), and the rest produce 
the others by overriding the default. Output 16.5 shows the main table of results. Test 
statistics are quoted for the intercept of the model (even MANOVA can be characterized 
as a regression model, although how this is done is beyond the scope of my brain) and 
for the Group variable. For our purposes, the group effects are of interest because they 
tell us whether or not the therapies had an effect on the OCD clients. You’ll see that the 
four multivariate test statistics and their values correspond to those calculated in sections 
16.4.4.2-16.4.4.5. In the next column these values are transformed into an F-ratio with 
2 degrees of freedom. The column of real interest, however, is the one containing the sig¬ 
nificance values of these F-ratios. For these data, Pillai’s trace (p = .049), Wilks’s lambda 
(p = .050) and Roy’s largest root (p = .020) all reach the criterion for significance of .05. 
However, Hotelling’s trace (p = .051) is non-significant by this criterion. This scenario is 
interesting, because the test statistic we choose determines whether or not we reject the 
null hypothesis that there are no between-group differences. However, given what we 
know about the robustness of Pillai’s trace when sample sizes are equal, we might be well 
advised to trust the result of that test statistic, which indicates a significant difference. This 
example highlights the additional power associated with Roy’s root (you should note how 
this statistic is considerably more significant than all others) when the test assumptions 
have been met and when the group differences are focused on one variate (which they are 
in this example, as we will see later). 
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Pillai's trace: 


Df 

Pillai , 

approx F num 

Df den Df 

Pr(>F) 

(Intercept) 1 

0.98285 

745.23 

2 26 

< 2e-16 *** 

Group 2 

0.31845 

2.56 

4 54 

0.04904 * 

Residuals 27 





Signif. codes: 

g ■ * * * 

■ 0.001 '**■ 

0.01 '*' 0. 

.05 ' . ' 0.1 

Wilk's lambda: 





Df 

Wilks 

approx F num Df den Df 

Pr(>F) 

(Intercept) 1 

0.01715 

745.23 

2 26 

< 2e-16 *** 

Group 2 

0.69851 

2.55 

4 52 

0.04966 * 

Residuals 27 





Signif. codes: 

g ■ * * * 

' 0.001 '**' 

0.01 0. 

\—1 

o 

Lf) 

O 


Hotelling's trace: 



Df 

Hotelling-Lawley 

approx F 

num Df 

den Df 

Pr(>F) 

(Intercept) 

1 

57.325 

745.23 

2 

26 

<2e-16 *** 

Group 

Residuals 

2 

27 

0.407 

2.55 

4 

50 

0.0508 . 


Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 0.1 ' ' 1 


Roy's largest root: 

Df Roy approx F num Df den Df Pr(>F) 


(Intercept) 1 

57.325 

745.23 

2 

26 < 2e-16 

Group 2 

Residuals 27 

0.335 

4.52 

2 

27 0.02027 

Signif. codes: 

g i***t 

0.001 '**' 

0.01 

Lf) 

O 

O 

* 


Output 16.5 



Type II and III sums of squares (D 


As with other times we have used the lm() function, or some variant of it, R will, by default, produce Type I sums 
of squares but it is usually preferable in (M)ANOVA to look at Type II (or even Type III) sums of squares. The differ¬ 
ences are explained in Jane Superbrain Box 11.1. When you have one predictor in the model, as we have in the 
current example, Type I, II and III sums of squares will give the same results so it doesn’t matter. However, with 
two or more predictors in the model you might prefer Type II or III sums of squares because they do not depend 
upon the order in which you enter variables into the model. In which case we can use the Anova() function from 
the car package, as we have in previous chapters, to obtain these sums of squares. In the current example, hav¬ 
ing created a model, ocdModel, we could display the Type II or III sums of squares by executing: 


Anova(ocdModel, type = "II") 
AnovaCocdModel, type = "III") 


It’s also worth bearing in mind that Type I, II and III sums of squares yield the same results when you have a 
balanced design (i.e., equal numbers of cases in all combinations of your predictor variables). 
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From this result we should probably conclude that the type of therapy employed had a 
significant effect on OCD. The nature of this effect is not clear from the multivariate test 
statistic: first, it tells us nothing about which groups differed from which; and second it 
tells us nothing about whether the effect of therapy was on the obsession-related thoughts, 
the obsession-related behaviours, or a combination of both. To determine the nature of the 
effect, we can look at univariate tests. 


16 . 6 . 8 . 


Follow-up analysis: univariate test statistics (D 


If we want to follow up the analysis with univariate analyses of the individual outcome 
measures, then we can simply execute: 

summary.aov(ocdModel) 

This produces Output 16.6, which shows the ANOVA summary table for the dependent 
variables. The table labelled Response 1 is for the Actions variable and Response 2 is for 
the Thoughts variable. The rows labelled Group show the values of the sums of squares for 
both actions and thoughts (these values correspond to the values of SS M calculated in sec¬ 
tions 16.4.3.1 and 16.4.3.2, respectively). The row labelled Residuals contains information 
about the residual sums of squares and mean squares for each of the dependent variables: 
these values of SS R were calculated in sections 16.4.3.1 and 16.4.3.2, and I urge you to 
look back to these sections to consolidate what these values mean. 

Response 1 : 



Df 

Sum Sq Mean Sq F value 

Pr(>F) 

Group 

2 

10.467 

5.2333 2.7706 0 

.08046 

Residuals 

27 

51.000 

1.8889 


Signif. codes: 

0 i * * * 

' 0.001 '**' 0.01 

' * ' 0 

Response 2 

Df 

Sum Sq 

Mean Sq F value 

Pr(>F) 

Group 

2 

19.467 

9.7333 2.1541 

0.1355 

Residuals 

27 

122.000 

4.5185 



Output 16.6 

The important parts of this table are the columns labelled F value and Pr(>F) in which 
the T-ratios for each univariate ANOVA and their significance values are listed. What 
should be clear from Output 16.6, and the calculations made in sections 16.4.3.1 and 
16.4.3.2, is that the values associated with the univariate ANOVAs conducted after the 
MANOVA are identical to those obtained if one-way ANOVA was conducted on each 
dependent variable. This fact illustrates that MANOVA offers only hypothetical protec¬ 
tion of inflated Type I error rates: there is no real-life adjustment made to the values 
obtained. 

The values of p in Output 16.6 indicate that there was a non-significant difference 
between therapy groups in terms of both obsession-related thoughts (p = .136) and obses¬ 
sion-related behaviours (p = .080). These two results should lead us to conclude that the 
type of therapy has had no significant effect on the levels of OCD experienced by clients. 
Those of you who are still awake may have noticed something odd about this example: the 
multivariate test statistics led us to conclude that therapy had a significant impact on OCD, 
yet the univariate results indicate that therapy has not been successful. 
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SELF-TEST 

s Why might the univariate tests be non-significant 
when the multivariate tests were significant? 


The reason for the anomaly in these data is simple: the multivariate test takes account 
of the correlation between dependent variables, and so for these data it has more power 
to detect group differences. With this knowledge in mind, the univariate tests are not par¬ 
ticularly useful for interpretation, because the groups differ along a combination of the 
dependent variables. To see how the dependent variables interact we need to carry out a 
discriminant function analysis, which will be described in due course. 


16.6.9. 


Contrasts (D 


I need to begin this section by reminding you that because the univariate ANOVAs were 
both non-significant we should not interpret these contrasts. However, purely to give you 
an example to follow for when your main analysis is significant, we’ll look at the contrasts. 
The contrasts are not part of the MANOVA model, and so to generate the output for them 
you have to create separate linear models for each outcome measure. This is basically the 
same as doing a one-way ANOVA on each outcome measure. So, for Thoughts and Actions 
we could create the following models using the aov() function (see Chapter 10): 

actionModel<-lm(Actions ~ Group, data = ocdData) 
thoughtsModel<-lm(Thoughts ~ Group, data = ocdData) 

The first command creates a model, actionModel, based on predicting the variable Actions 
from Group (Actions ~ Group) and the second command does much the same but predict¬ 
ing Thoughts. We can get the contrast parameters by using summary.lm(), just as we did 
in Chapter 10: 

summary.ImCactionModel) 
summary.ImCthoughtsModel) 

In section 16.6.61 suggested carrying out a contrast that compares each of the therapy groups 
to the no-treatment control group. The results of these contrasts are shown in Output 16.7 
(for Actions) and Output 16.8 (for Thoughts). The contrasts will be labelled helpfully if, 
like I did, you set the contrasts manually and give them sensible names. The main thing to 
notice (from the values of Pr(>\t\)) is that when we compare CBT to NT there are no sig¬ 
nificant differences in thoughts (p = .104) or behaviours (p = .872), because both values are 
above the .05 threshold. However, comparing BT to NT, there is no significant difference in 
thoughts (p = .835) but there is a significant difference in behaviours between the groups (p 
= .044, which is less than .05). This is a little unexpected because the univariate ANOVA for 
behaviours was non-significant and so we would not expect there to be group differences. 

Coefficients: 

Estimate Std. Error t value Pr(>|t|) 

(Intercept) 5.0000 0.4346 11.504 6.47e-12 *** 

GroupCBT_vs_NT -0.1000 0.6146 -0.163 0.8720 

GroupBT_vs_NT -1.3000 0.6146 -2.115 0.0438 * 

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 0.1 ' ' 1 

Output 16.7 
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Coefficients: 

(Intercept) 

GroupCBT_vs_NT 

GroupBT_vs_NT 


Estimate Std 
15.0000 
-1.6000 
0.2000 


Error t value 
0.6722 22.315 
0.9506 -1.683 
0.9506 0.210 


Pr(>|t|) 

<2e-16 *** 
0.104 
0.835 


Signif. codes: 


0 '***' 0.001 '**' 0.01 


0.05 


0.1 


Output 16.8 


1 



CRAMMING SAM’S TIPS 


MANOVA 


• MANOVA is used to test the difference between groups across several dependent variables simultaneously. 

• The test assumed multivariate normality and homogeneity of covariance matrices. This latter assumption can be ignored 
when sample sizes are equal because some MANOVA test statistics are robust to violations of this assumption. Multivariate 
normality can be tested with a multivariate version of the Shapiro-Wilktest: if it is significant (p < .05) then the assumption 
is violated. 

• There are four test statistics that can be used in MANOVA ( Pillai’s trace, Wilks's lambda, Hotelling’s trace and Roy's 
largest root). I recommend using Pillai's trace. If the p-value of this statistic is less than .05 then the groups differ signifi¬ 
cantly with respect to the dependent variables. 

• ANOVAs can be used to follow up the MANOVA (a different ANOVA for each dependent variable). These ANOVAs can in turn 
be followed up using contrasts (see Chapters 10-14). Personally I don't recommend this approach and suggest conducting 

a discriminant function analysis. 


16.7. Robust MANOVA (D 


Wilcox provides functions for two robust methods for MANOVA (Wilcox, 2005), both of 
which are based on ranking the data (see Chapter 15). To access these tests we need to load 
the WRS package (see section 5.8.4.). There are two functions that we will look at: 

• mulrank(): This performs a MANOVA on the ranked data using Munzel and Brunner’s 
method (Munzel & Brunner, 2000). 

• cmanova(): This performs Choi and Marden’s (1997) robust test based on the ranked 
data. It is an extension of the Kruskal-Wallis test that was described in the Chapter 15. 

Both of these functions can be used only when you have one predictor (i.e., one independ¬ 
ent variable). For more complex designs you should accept defeat. Remember that our data 
are currently in this format (I’ve edited out some cases): 


1 

Group 

CBT 

Actions 

5 

Thoughts 

14 

10 

CBT 

4 

11 

11 

BT 

4 

14 

20 

BT 

5 

17 

21 

NT 

4 

13 

30 

NT 

5 

18 
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FIGURE 16.7 

Restructuring 
the OCD data for 
robust MANOVA 


The robust functions need the data to be in wide format rather than long (see Chapter 3). 
Figure 16.7 shows the existing data format and how we need it to look (wide). Essentially we 
want levels of the independent variable (Group) and outcome measures (Thoughts and Actions) 
to be represented in different columns. The outcome measures are already spread across dif¬ 
ferent columns (Thoughts and Actions), but the treatment group is differentiated by different 
rows of data (rows 1-10 are those in a CBT group, rows 11-20 are the BT group, and so on). 
Therefore, we need to take the rows representing people who were in the BT and NT groups 
and shift them into columns alongside the columns currently labelled Thoughts and Actions. 

We can do this restructuring using the melt() and cast() functions from the reshape pack¬ 
age. To get the restructuring to work, we need to add a variable to our dataframe that 
identifies the rows in the wide format. Notice in Figure 16.7 that the data are made up of 
six chunks that represent the three treatment groups and the two outcome measures. We 
want to move the chunks that are currently stacked on top of each other so that they are 
beside each other (Figure 16.7). To do this, R needs to know what row a particular score 
will end up in when we move each block of scores from the stacks into the columns. The 
easiest approach is simply to create a variable (called row) that identifies within each chunk 
the row number of a given score. In other words, it will be a value telling us whether the 
score is the first, second, third, etc. score within the chunk. At the moment, the chunks are 
stacked on top of each other, so we want a variable that is the sequence of numbers 1 to 
10 repeated for the three different treatment groups (because they all contain 10 rows of 
data). We can add this variable to the dataframe by executing: 

ocdData$row<-rep(l:10, 3) 

Executing this command creates a variable row in the dataframe ocdData, that is the numbers 
1 to 10 repeated three times. The structure of the data will be the same as before - it’s just 
that we have a new variable called row that identifies the scores within each treatment group. 
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Next we need to make the data molten so that we can cast them into the wide format. To 
do this we use the melt() function (see section 3.9.4). Remember that in this function we 
differentiate variables that identify attributes of the scores (in this example, Group and row 
tell us about a given score, for example, that it was the fifth score in the CBT group) from 
the scores themselves (in this case the columns labelled Actions and Thoughts both contain 
scores). Attributes are specified with the id option, and scores with the measured option. 
Therefore, we can create a molten dataframe called ocdMelt by executing: 

ocdMelt<-meltCocdData, id = c("Group", "row"), measured = c("Actions", 
"Thoughts")) 

The data now look like this (I have edited out many cases to save space): 


1 

Group 

CBT 

row 

1 

variable 

Actions 

value 

5 

10 

CBT 

10 

Actions 

4 

11 

BT 

1 

Actions 

4 

19 

BT 

9 

Actions 

2 

26 

NT 

6 

Actions 

4 

27 

NT 

7 

Actions 

7 

33 

CBT 

3 

Thoughts 

16 

34 

CBT 

4 

Thoughts 

13 

41 

BT 

1 

Thoughts 

14 

42 

BT 

2 

Thoughts 

15 

51 

NT 

1 

Thoughts 

13 

60 

NT 

10 

Thoughts 

18 


The variable that differentiates whether the outcome measure was thoughts or actions 
has been labelled variable and the variable that contains the frequencies of thoughts/ 
behaviours is called value. These labels are not that informative, so let’s rename them as 
Outcome_Measure and Frequency using the namesQ function. 

names(ocdMelt)<-c("Group" , "row", "Outcome_Measure", "Frequency") 

Executing this command takes the dataframe ocdMelt and assigns the names in c() to each 
column. As such, our variables all now have names that relate to what they represent. 

Finally, we want to cast our data into the wide format using cast(). To do this we use a 
formula in the form: variables specifying the rows ~ variables specifying the columns. In 
this case, row tells us which row to place a score, and we want the Group and Outcome_ 
Measures variables split across different columns, so we’d use the formula: 6 row ~ Group 
+ OutcomeJMeasures. Therefore, we can make a wide dataframe called ocdRobust by 
executing: 

ocdRobustc-castfocdMelt, row ~ Group + Outcome_Measure, value = 

"Frequency") 

Note that we have applied this command to the molten data set (ocdMelt). The value = 
“Frequency” explicitly tells the function in which column to find the outcome variable 
(without this command the function will take an educated guess, but it’s good practice to 
be specific). 

6 It’s important that you specify Group and Outcome_Measure in this order becaue this arranges the data correctly 
for Wilcox’s functions. If you use row ~ Outcome ^Measures + Group then the resulting data would be structured 
as CBT_Action, BT_Action , NT_Action , CBT_Thoughts, BTJThoughts , NT_Thoughts. 
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The result is that the data have been transformed to the wide format. However, because 
we added the variable row to the dataframe, our new dataframe also contains this variable, 
and for the analysis we don’t want it. We can remove this variable by executing: 

ocdRobust$row<-NULL 

You should find that you now have a wide format set of data: 
ocdRobust 

CBT_Action CBT_Thoughts BT_Action BT_Thoughts NT_Action NT_Thoughts 


5 

14 

4 

14 

4 

13 

5 

11 

4 

15 

5 

15 

4 

16 

1 

13 

5 

14 

4 

13 

1 

14 

4 

14 

5 

12 

4 

15 

6 

13 

3 

14 

6 

19 

4 

20 

7 

12 

5 

13 

7 

13 

6 

15 

5 

18 

4 

16 

6 

16 

2 

14 

6 

14 

4 

11 

5 

17 

5 

18 


It’s important to note the order of the columns: the hierarchy of the independent vari¬ 
ables is Group followed by Outcome_Measures. In other words, we have taken the six 
groups of scores and first divided them into CBT, BT and NT, then within these groups we 
have subdivided according to which outcome measure was used. 

The functions mulrank() and cmanovaQ both take the same general form: 

mulrankCnumber of groups, number of outcome measures, data) 
cmanova(number of groups, number of outcome measures, data) 

We need only specify the dataframe ( ocdRobust) and then the number of groups (three in 
this case) and the number of outcome measures (two in this case). Therefore, we can do a 
robust MANOVA based on ranks by executing: 

mulrank(3, 2, ocdRobust) 
cmanova(3, 2, ocdRobust) 


mulrankQ 

cmanovaQ 

$test.stat 

$test.stat 

[1] 1.637357 

[1] 9.057746 

$nul 

$df 

[1] 3.643484 

[1] 4 

$p.value 

$p.value 

[, 1] 

[, 1] 

[1,] 0.1675409 

[1,] 0.0596722 

$N 


[1] 30 


$q.hat 


[,1] [/ 2 ] 


[1,] 0.5533333 0.3666667 


[2,] 0.3750000 0.5900000 


[3,] 0.5716667 0.5433333 



Output 16.9 
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The output of both of these commands is shown in Output 16.9. For mulrank() (left-hand 
side of Output 16.9) we are given a test statistic for the type of treatment ( $test.stat ) as well 
as the corresponding p-value ($p.value). We could conclude that there was no significant 
main effect of the type of treatment on outcomes of OCD, F = 1.64, p = .168. The numbers 
under $q.hat tell us the relative effects (i.e., the typical ranks across the combinations of 
groups in the rows and outcome measures in the columns). We could relabel this grid as: 


[CBT] 
[ BT] 
[ NT] 


[Actions] 

0.5533333 

0.3750000 

0.5716667 


[Thoughts] 

0.3666667 

0.5900000 

0.5433333 


This shows that in the NT groups the ranks were fairly similar for thoughts and actions 
(0.57 and 0.54). For BT the ranks were lower for actions (0.38) than thoughts (0.59), and 
for CBT the reverse was true: ranks were lower for thoughts (0.37) than actions (0.55). In 
other words, CBT affected thoughts more than actions, and BT affected actions more than 
thoughts. However, the overall effect was not significant. 

The output of cmanova() (right-hand side of Output 16.9) tells us much the same things: 
we get a test statistic ($test.stat), the degrees of freedom ($df) and an associated p-value 
($p. value). We could conclude that there was no significant main effect of the type of treat¬ 
ment on outcomes of OCD, H(4) = 9.06, p = .060. 


16.8. Reporting results from MANOVA © 


Reporting a MANOVA is much like reporting an ANOVA. As you can see in Output 16.5, 
the multivariate tests are converted into approximate F s, and people often just report these 
Fs just as they would for ANOVA (i.e., they give details of the F-ratio and the degrees 
of freedom from which it was calculated). For our effect of group, we would report the 
hypothesis df and the error df. Therefore, we could report these analyses as: 

S There was a significant effect of therapy on the number of obsessive thoughts and 
behaviours, F(4, 54) = 2.56, p < .05. 

However, in my opinion, the multivariate test statistic should be quoted as well. There 
are four different multivariate tests reported in Output 16.5; I’ll report each one in turn 
(note that the degrees of freedom and value of F change), but in reality you would just 
report one of the four: 

^ Using Pillai’s trace, there was a significant effect of therapy on the number of obses¬ 
sive thoughts and behaviours, V = 0.32, F(4, 54) = 2.56, p < .05. 

^ Using Wilks’s lambda statistic, there was a significant effect of therapy on the number 
of obsessive thoughts and behaviours, A = 0.70, F(4, 52) = 2.56, p < .05. 

S Using Hotelling’s trace statistic, there was not a significant effect of therapy on the 
number of obsessive thoughts and behaviours, T = 0.41, F(4, 50) = 2.55, p > .05. 

S Using Roy’s largest root, there was a significant effect of therapy on the number of 
obsessive thoughts and behaviours, 0 = 0.35, F(2, 27) = 4.52, p < .05. 

We can also report the follow-up ANOVAs in the usual way (see Output 16.6): 

S Using Pillai’s trace, there was a significant effect of therapy on the number of obsessive 
thoughts and behaviours, V = 0.32, F(4, 54) = 2.56, p < .05. However, separate uni¬ 
variate ANOVAs on the outcome variables revealed non-significant treatment effects on 
obsessive thoughts, F(2, 27) = 2.15, p > .05, and behaviours, F(2, 27) = 2.77, p > .05. 
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If you have used a robust MANOVA (Output 16.9) then you might report this as follows: 

S A MANOVA was conducted on the ranked data using Munzel and Brunner’s (2000) 
method, implemented in R using the mulrank() function (Wilcox, 2005). There was no 
significant main effect of the type of treatment on outcomes of OCD, F = 1.64, p = .168. 

S A MANOVA was conducted on the ranked data using Choi and Marden’s (1997) 
method, implemented in R using the cmanovaQ function (Wilcox, 2005). There was 
no significant main effect of the type of treatment on outcomes of OCD, H( 4) = 9.06, 
p = .060. 

16.9. Following up MANOVA with 
discriminant analysis ® 


I mentioned earlier on that a significant MANOVA could be followed up using either uni¬ 
variate ANOVA or discriminant analysis (sometimes called discriminant function analysis or 
DFA for short). In the example in this chapter, the univariate ANOVAs were not a useful 
way of looking at what the multivariate tests showed because the relationship between 
dependent variables is obviously having an effect. However, these data were designed espe¬ 
cially to illustrate how the univariate ANOVAs should be treated cautiously and in real life 
a significant MANOVA is likely to be accompanied by at least one significant ANOVA. 
However, this does not mean that the relationship between dependent variables is not 
important, and it is still vital to investigate the nature of this relationship. Discriminant 
analysis is the best way to achieve this, and I strongly recommend that you follow up a 
MANOVA with both univariate tests and discriminant analysis if you want to fully under¬ 
stand your data. 

In discriminant analysis we look to see how we can best separate (or discriminate) a set 
of groups using several predictors (so it is a little like logistic regression but where there 
are several groups rather than two). 7 In some senses it might seem as though we’re doing 
the reverse of the MANOVA: in MANOVA we predicted a set of outcome measures from 
a grouping variable, whereas in DFA we predict a grouping variable from a set of outcome 
measures. However, the basic underlying principles of these tests are the same: remember 
that when we looked at the theory of MANOVA we saw that it works by identifying linear 
variates that best differentiate the groups, and these ‘linear variates’ are the ‘functions’ in 
discriminant function analysis. 

Discriminant analysis is quite straightforward in R: you use the lda() function from the 
MASS package. The basic format of this function is: 

newModel<-lda(Group ~ Predictor(s), data = dataFrame, prior = prior probabilities, 
na.action = "na.omit") 

There are a host of other options that you can use (execute flda for more information), 
but within the context of MANOVA this is all we really need. Within the function, Group 
is the name of the variable in your dataframe that contains the groups that you’re trying 
to discriminate, and Predictor(s) is a list of continuous variables from which you are trying 
to make the discrimination. This creates a formula for a linear model (just as we have seen 
many times in this book). So, if you’re using a single predictor your model might be speci¬ 
fied as Group ~ Predictor, but with two or more predictors you simply add each predictor 

7 In fact, I could have just as easily described discriminant analysis rather than logistic regression in 
Chapter 8 because they are different ways of achieving the same end result. However, logistic regres¬ 
sion has far fewer restrictive assumptions and is generally more robust, which is why I have limited 
the coverage of discriminant analysis to this chapter. 
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in using the ‘+’ symbol; for example, Group ~ Predictor 1 + Predictor! +Predictor3. In our 
case we want to predict the variable Group from the variables Thoughts and Actions so our 
model is Group ~ Actions + Thoughts. The data option just lets you specify the name of 
your dataframe (in this case ocdData). The na.action option determines how missing data 
are handled. By default the function will simply fail, but you can set the option to na.omit 
(as shown above) to delete missing cases instead (R’s Souls’ Tip 7.1). We can ignore this 
option altogether because we have no missing data. There is also an option to set the prior 
probabilities but when our groups have equal sample sizes we can omit this option (R’s 
Souls’ Tip 16.2). For the current data, we could, therefore, execute: 

ocdDFA<-lda(Group ~ Actions + Thoughts, data = ocdData) 

This creates a model called ocdDFA. To see this model execute the name of the model: 

ocdDFA 



Prior probabilities © 


To do a DFA, you should set the prior probabilities, that is, the probability of belonging to a particular group. This 
value is simply the ‘chance’ of a case being in a particular group. In our OCD example there are 30 cases. If we 
wanted to know the probability that a case was in the CBT group, then we can look at the number of cases in that 
group. There were 10 cases, and 30 in total, therefore the probability of being in the CBT group is 10/30 = .33. 
In other words, a third of the whole sample was in the CBT group. Therefore, in general, the prior probability is: 


prior probability of group = p 

When group sizes are equal (as they are in our OCD example) then the prior probability is the same for every 
group. The lda() function assumes this scenario and so when you have equal sized groups you don’t need to 
worry or think about prior probabilities unless you have a good theoretical reason not to base them on the sample 
size of the group. 

However, when you have unequal group sizes it is a good idea to base the prior probabilities on the sample 
size of the group. We can do this using the prior option of lda(). If you’re basing prior probabilities on sample 
sizes, then you can set this option, in general, as: 

prior = c(n i; n 2 , n 3 )/N 

in which the ns refer to group sample sizes, and N is the total sample size. Imagine, in our OCD example, that 
the CBT group contained 20 people, the BT group 18, and the NT group 12. This is 40 cases in total. Therefore, 
we would write: 

prior = c(20, 18, 12)/40 

Note that you must be careful to put the sample sizes in the correct order (i.e., 20 will be assumed to be the 
sample size of level 1 of the grouping variable, and 18 the sample size for level 2 and so on). In the context of the 
lda() function, we would execute: 

ocdDFA<-lda(Group ~ Actions + Thoughts, data = ocdData, prior = c(20, 18, 12)/40) 

You can extend this idea to more than three groups. For example, with five groups with sample sizes of 5,10, 
5, 20, 10 (and, therefore, a total of 50) the prior option would be: 

prior = c(5, 10, 5, 20, 10)750 
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Call: 

lda(Group ~ Actions + Thoughts, data = ocdData, na.action = "na.omit") 


Prior probabilities of groups: 

CBT BT NT 

0.3333333 0.3333333 0.3333333 


Group means: 

Actions Thoughts 


CBT 4.9 13.4 
BT 3.7 15.2 
NT 5.0 15.0 


Coefficients of linear discriminants: 

LDl LD2 
Actions 0.6030047 -0.4249451 
Thoughts -0.3352478 -0.3392631 


Proportion of trace: 

LDl LD2 
0.8219 0.1781 

Output 16.10 


Output 16.10 shows us first the prior probabilities, which are .33 for each group; these 
are the group sample size divided by the total sample size (i.e., 10/30 = .33), and because 
the sample sizes of the three groups are equal, the prior probabilities are the same for each 
group too (R’s Souls’ Tip 16.2). Next we are given the group means, which we have already 
computed before running the main analysis so are not particularly interesting to revisit. 

The main part of the output tells us the coefficients of the linear discriminants, which 
in plain English are the values of b in equation (16.4). You’ll notice that these values cor¬ 
respond to the values in the eigenvectors derived in section 16.4.4.1 and used in equation 
(16.5). Given that the variates can be expressed in terms of a linear regression equation (see 
equation (16.4)), the coefficients of the linear discriminants are equivalent to the unstand¬ 
ardized betas in regression. Hence, the coefficients tell us the relative contribution of each 
variable to the variates. If we look at variate 1 first, thoughts and behaviours have the 
opposite effect (behaviour has a positive relationship with this variate, whereas thoughts 
have a negative relationship). The first variate, then, could be seen as one that differenti¬ 
ates thoughts and behaviours (it affects thoughts and behaviours in the opposite way). Both 
thoughts and behaviours have a strong relationship with the second variate. This tells us 
that this variate represents something that affects thoughts and behaviours in a similar way. 
Remembering that ultimately these variates are used to differentiate groups, we could say 
that the first variate differentiates groups by some factor that affects thoughts and behav¬ 
iours differently, whereas the second variate differentiates groups on some dimension that 
affects thoughts and behaviours in the same way. 

Finally the proportion of trace shows us that the first variate accounts for 82.2% of vari¬ 
ance compared to the second variate, which accounts for only 17.8%. These proportions 
are the eigenvalues for each variate (i.e., the values of the diagonal elements of the matrix 
HE -1 ) expressed as a proportion. 


$x 

LDl 

1 0.4602010 

2 1.4659443 

3 -0.8132992 

4 0.1924441 


LD2 

-0.01736741 
1.00042182 
-0.27094845 
0.74684078 
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5 

1.1306965 

0.66115874 

6 

-0.7458083 

0.83252282 

7 

2.3367058 

-0.18873149 

8 

0.7279579 

-0.78157561 

9 

0.3927101 

-1.12083868 

10 

0.8629396 

1.42536694 

11 

-0.1428037 

0.40757770 

12 

-0.4780514 

0.06831463 

13 

-1.6165699 

2.02167613 

14 

-1.9518176 

1.68241305 

15 

-0.4780514 

0.06831463 

16 

-0.6130332 

-2.13862792 

17 

0.7954487 

0.32189566 

18 

-0.8807901 

-1.37441972 

19 

-1.3488130 

1.25746794 

20 

-0.5455423 

-1.03515665 

21 

0.1924441 

0.74684078 

22 

0.1249532 

-0.35663049 

23 

0.4602010 

-0.01736741 

24 

-0.1428037 

0.40757770 

25 

1.3984534 

-0.10304945 

26 

-2.1542903 

-1.62800076 

27 

2.0014580 

-0.52799457 

28 

-0.8132992 

-0.27094845 

29 

1.0632056 

-0.44231253 

30 

-0.8807901 

-1.37441972 

Output 16.11 



It is sometimes useful to look at the discriminant scores. These are the scores for each 
person, on each variate, obtained from equation (16.5). These scores can be useful because 
the variates that the analysis identifies may represent underlying social or psychological 
constructs. If these constructs are identifiable, then it is useful for interpretation to know 
what a participant scores on each dimension. To obtain these scores execute: 

predict(ocdDFA) 

The resulting Output 16.11 shows each participant’s score on the first (LD1) and second 
(LD2) variate. 

Perhaps more useful than the scores themselves is a plot of the scores broken down by 
group membership. This can be obtained by using the plot() function on our model: 

plot(ocdDFA) 

By executing this command you will produce the plot at the top of Figure 16.8. This graph 
plots the variate scores for each person, grouped according to the experimental condition 
to which that person belonged. To interpret this plot, I have broken it down to look at the 
first variate (bottom left of Figure 16.8) separate from the second variate (bottom right of 
Figure 16.8). To discover which groups variate 1 discriminates we need to try to ignore 
variate 2 (that’s why I have blanked out the axis for LD2) and look at how the groups 
change as we move along variate 1. In other words, we ignore the vertical position of each 
point on the plot, and look at how the groups are distributed along the horizontal axis 
(LD1). A crude, but simple, way to do this is to split the vertical axis down the middle (as I 
have done with a vertical blue line at 0 on the scale) and ask yourself which groups cluster 
on either side of the line. I have circled the BT groups in light blue and the CBT groups in 
black. Hopefully, this will make clear that to the left of the blue vertical line there are a lot 
of cases from the BT group, but to the right of the blue vertical line there are lots of cases 
from the CBT group. In other words, the blue vertical line seems to separate the BT group 
from the CBT group. This tells us that variate 1 discriminates the BT group from the CBT. 
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Moving on to the second variate (bottom right of Figure 16.8), we now need to focus on 
LD2 and try to ignore LD1 (I have blanked out the axis to help). In other words, we ignore 
the horizontal position of each point on the plot, and look at how the groups are distributed 
along the vertical axis (LD2). Again, you can get a rough idea of what’s happening by split¬ 
ting the vertical axis down the middle (as I have done with a horizontal blue line at 0, the 
midpoint of the scale) and ask yourself which groups cluster on either side of the line. The 
picture is not as clear as for variate 1, but it seems to me that there are a lot of cases from 
the NT group below the line, but hardly any above. I have highlighted these cases with dark 
blue circles to help you to see. Looking at the BT and CBT cases they tend to fall above the 
line (although not always). This pattern suggests that the second variate differentiates the 
no-treatment group (cases are below the blue horizontal line) from the two interventions 
(the cases are typically above the blue horizontal line), but this difference is not as dramatic 
as for the first variate. Remember that the variates significantly discriminate the groups in 
combination (i.e., when both are considered). 


FIGURE 16.8 

Combined-groups 

plot 
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CRAMMING SAM’S TIPS 


Discriminant function analysis 


• Discriminant function analysis can be used after MANOVA to see how the dependent variables discriminate the 
groups. 

• DFA identifies variates (combinations of the dependent variables) that discriminate groups of cases. 

• Look at the Coefficients of linear discriminants to find out how the dependent variables contribute to the variates. High 
scores indicate that a dependent variable is important for a variate, and variables with positive and negative coefficients are 
contributing to the variate in opposite ways. 

• Finally, to find out which groups are discriminated by a variate, look at the plot of discriminant scores. Split the vertical and 
horizontal axes at the midpoint and look at which groups tend to fall on either side of the line. The variate plotted on a given 
axis is discriminating between groups that fall on different sides of the line (i.e., the midpoint). 


16.10. Reporting results from 
discriminant analysis © 


The guiding principle in presenting data is to give the readers enough information to be 
able to judge for themselves what your data mean. Personally, I would suggest reporting 
the percentage of variance explained (which gives the reader the same information as the 
eigenvalue but in a more palatable form) and the coefficients of linear discriminants. All of 
these values can be found in Output 16.10. Finally, although I won’t reproduce it below, 
you could consider including a copy of the discriminant scores plot (Figure 16.8), which 
will help readers to determine how the variates contribute to distinguishing your groups. 
We could, therefore, write something like this: 

S The MANOVA was followed up with discriminant analysis, which revealed two dis¬ 
criminant functions. The first explained 82.2% of the variance, whereas the second 
explained only 17.8%. The coefficients of the discriminant functions revealed that 
function 1 differentiated obsessive behaviours (b = 0.603) and thoughts (b =-0.335). 
The second variate produced similar coefficients for actions (-0.425) and thoughts 
(-0.339). The discriminant function plot showed that the first function discriminated 
the BT group from the CBT group, and the second function differentiated the no¬ 
treatment group from the two interventions. 


16.11. Some final remarks © 


16 . 11 . 1 . 


The final interpretation © 


So far we have gathered an awful lot of information about our data, but how can we bring 
all of it together to answer our research question: can therapy improve OCD and, if so, 
which therapy is best? Well, the MANOVA tells us that therapy can have a significant 
effect on OCD symptoms, but the non-significant univariate ANOVAs suggested that this 
improvement is not simply in terms of either thoughts or behaviours. The discriminant 
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Labcoat Leni’s Real Research 16.1 


A lot of hot air! @ 


Marzillier, S. L., & Davey, G. C. L. (2005). Cognition and Emotion , 19, 729-750. 


Have you ever wondered what researchers do in their spare time? Well, some of them spend it tracking down the 
sounds of people burping and farting! It has long been established that anxiety and disgust are linked. Anxious 
people are, typically, easily disgusted. Throughout this book I have talked about how you cannot infer causality 
from relationships between variables. This has been a bit of a conundrum for anxiety researchers: does anxiety 
cause feelings of digust or does a low threshold for being disgusted cause anxiety? Two colleagues of mine 
at Sussex addressed this in an unusual study in which they induced feelings of anxiety, feelings of disgust, or 
a neutral mood, and they looked at the effect that these induced moods had on feelings of anxiety, sadness, 
happiness, anger, disgust and contempt. To induce these moods, they used three different types of manipula¬ 
tion: vignettes (e.g., ‘You’re swimming in a dark lake and something brushes your leg’ for anxiety, and 'You go 
into a public toilet and find it has not been flushed. The bowl of the toilet is full of diarrhoea’ for disgust), music 
(e.g., some scary music for anxiety, and a tape of burps, farts and vomitting for disgust), videos (e.g., a clip from 
Silence of the Lambs for anxiety and a scene from Pink Flamingos in which Divine eats dog faeces for disgust) 
and memory (remembering events from the past that had made the person anxious, disgusted or neutral). 

Different people underwent anxious, disgust and neutral mood inductions. Within these groups, the induction 
was done using either vignettes and music, videos, or memory recall and music for different people. The outcome 
variables were the change (from before to after the induction) in six moods: anxiety, sadness, happiness, anger, 
disgust and contempt. 

The data are in the file Marzillier and Davey (2005).dat. Draw an error bar graph of the changes in moods 
in the different conditions, then conduct a 3 (Mood: anxiety, disgust, neutral) x 3 (Induction: vignettes + music, 
videos, memory recall + music) MANOVA on these data. Whatever you do, don’t imagine what their fart 


tape sounded like while you do the analysis! 


Answers are in the additional material on the companion website (or look at page 738 of the original 
article). 



analysis suggests that the group separation can be best explained in terms of one underlying 
dimension. In this context the dimension is likely to be OCD itself (which we can realisti¬ 
cally presume is made up of both thoughts and behaviours). So, therapy doesn’t necessarily 
change behaviours or thoughts per se, but it does influence the underlying dimension of 
OCD. So, the answer to the first question seems to be: yes, therapy can influence OCD, but 
the nature of this influence is unclear. 

The next question is more complex: which therapy is best? Figures 16.3 and 16.4 show 
the relationships between the dependent variables and the group means of the original 
data. The graph of the means (Figure 16.4) shows that for actions, BT reduces the number 
of obsessive behaviours, whereas CBT and NT do not. For thoughts, CBT reduces the 
number of obsessive thoughts, whereas BT and NT do not (check the pattern of the bars). 
Looking now at the relationships between thoughts and actions (Figure 16.3), in the BT 
group there is a positive relationship between thoughts and actions, so the more obsessive 
thoughts a person has, the more obsessive behaviours they carry out. In the CBT group 
there is no relationship at all (thoughts and actions vary quite independently). In the no¬ 
treatment group there is a negative (and non-significant incidentally) relationship between 
thoughts and actions. 

What we have discovered from the discriminant analysis is that BT and CBT can be dif¬ 
ferentiated from the control group based on variate 2, a variate that has a similar effect on 
both thoughts and behaviours. We could say then that BT and CBT are both better than 
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a no-treatment group at changing obsessive thoughts and behaviours. We also discovered 
that BT and CBT could be distinguished by variate 1, a variate that had the opposite effects 
on thoughts and behaviours. 

We could conclude that BT is better at changing behaviours and CBT is better at chang¬ 
ing thoughts. So, the NT group can be distinguished from the CBT and BT graphs using 
a variable that affects both thoughts and behaviours. Also, the CBT and BT groups can be 
distinguished by a variate that has opposite effects on thoughts and behaviours. So, some 
therapy is better than none, but the choice of CBT or BT depends on whether you think 
it’s more important to target thoughts (CBT) or behaviours (BT). 


16 . 11 . 2 . 


Univariate ANOVA or discriminant analysis? (D 


This example should have made clear that univariate ANOVA and discriminant analysis are 
ways of answering different questions arising from a significant MANOVA. If univariate 
ANOVAs are chosen, Bonferroni corrections should be applied to the level at which you 
accept significance. The truth is that you should run both analyses to get a full picture of 
what is happening in your data. The advantage of discriminant analysis is that it tells you 
something about the underlying dimensions within your data (which is especially useful 
if you have employed several dependent measures in an attempt to capture some social 
or psychological construct). Even if univariate ANOVAs are significant, the discriminant 
analysis provides useful insight into your data and should be used. I hope that this chapter 
will convince you of this recommendation. 



What have I discovered about statistics? © 


In this chapter we’ve cackled maniacally in the ear of MANOVA, force-fed discriminant 
function analysis cod-liver oil, and discovered to our horror that Roy has a large root. 
There are sometimes situations in which several outcomes have been measured in dif¬ 
ferent groups, and we discovered that in these situations the ANOVA technique can be 
extended and is called MANOVA (multivariate analysis of variance). The reasons for 
using this technique rather than running lots of ANOVAs are that we retain control over 
the Type I error rate, and we can incorporate the relationships between outcome vari¬ 
ables into the analysis. Some of you will have then discovered that MANOVA works in 
very similar ways to ANOVA, but just with matrices rather than single values. Others 
will have discovered that it’s best to ignore the theory sections of this book. We had a 
look at an example of MANOVA and discovered that, just to make life as confusing 
as possible, you get four test statistics relating to the same effect! Of these, I tried to 
convince you that Pillai’s trace was the safest option. Finally, we had a look at the two 
options for following up MANOVA: running lots of ANOVAs, or doing a discriminant 
function analysis. Of these, discriminant function analysis gives us the most information, 
but can be a bit of a nightmare to interpret. 

We also discovered that pets can be therapeutic. I left the whereabouts of Fuzzy a mys¬ 
tery. Now admit it, how many of you thought he was dead? He’s not: he is lying next to 
me as I type this sentence. After frantically searching the house I went back to the room 
that he had vanished from to check again whether there was a hole that he could have 
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wriggled through. As I scuttled around on my hands and knees tapping the walls, a little 
ginger (and sooty) face popped out from the fireplace with a look as if to say ‘have you 
lost something?’ (see Figure 16.9). Yep, freaked out by the whole moving experience, he 
had done the only sensible thing and hidden up the chimney! Cats, you gotta love ’em. 


FIGURE 16.9 

Fuzzy hiding up a 
fireplace 



R packages used in this chapter 


car 

ggplot2 

mvnormtest 

mvoutlier 


pastecs 

reshape 

WRS 


R functions used in this chapter 


aov() 

lm() 

aq.plot() 

manova() 

byO 

meltO 

cO 

mshapiro.test() 

cast() 

mulrank() 

cbind() 

names() 

cmanova() 

plot() 

contrasts() 

predict!) 

cov() 

stat.desc() 

factor!) 

summaryO 

ggpioto 

summary.aov() 

giO 

summary.lm() 

lda() 

t() 
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Key terms that I’ve discovered 


Box’s test 

Discriminant function analysis (DFA) 
Discriminant function variates 
Discriminant scores 
Error SSCP (E) 

HE-' 

Homogeneity of covariance matrices 
Hotelling-Lawley trace (P) 
Hypothesis SSCP (H) 

Identity matrix 
Independence 
Multivariate 


Multivariate analysis of variance (MANOVA) 
Multivariate normality 
Pillai—Bartlett trace (V) 

Random sampling 
Roy’s largest root 
Square matrix 

Sum of squares and cross-products 
matrix (SSCP) 

Total SSCP (7) 

Univariate 

Variance-covariance matrix 
Wilks’s lambda (A) 


Smart Alex’s tasks 


• Task 1: A clinical psychologist noticed that several of his manic psychotic patients 
did chicken impersonations in public. He wondered whether this behaviour could 
be used to diagnose this disorder and so compared 10 of his patients against 10 of 
the most normal people he could find: naturally he chose to observe lecturers at the 
University of Sussex. He measured how many chicken impersonations they did over 
the course of a day, and how good their impersonations were (as scored out of 10 
by an independent farmyard noise expert). The data are in the file chicken.dat. Use 
MANOVA and DFA to find out whether these variables could be used to distinguish 
manic psychotic patients from those without the disorder. ® 



• Task 2: I was intrigued by a news story claiming that children who lie would become 
successful citizens (http://bit.ly/ammQNT). I was particularly intrigued because 
although the article cited a lot of well-conducted work by Dr Khang Lee that shows 
that children lie, I couldn’t find anything at all in that well-conducted work that 
supported the journalist’s claim that children who lie become successful citizens. 
However, let’s imagine a Huxleyesque parallel universe in which the government is 
stupid enough to believe the contents of this newspaper story and decides to imple¬ 
ment a systematic programme of infant conditioning. Some infants were trained not 
to lie, others were bought up as normal, and a final group was trained in the art of 
lying. Thirty years later, they collected data on how successful these children were as 
adults. They measured their salary, and two indices of how successful they were in 
their family and work life, on a 0-10 scale (10 = as successful as could possibly be, 
0 = better luck in your next life). The data are in lying.dat. Use MANOVA and DFA 
to find out whether, in this completely fabricated parallel universe, lying really does 
make you a better citizen. ® 


• Task 3: I was interested in whether students’ knowledge of different aspects of psy¬ 
chology improved throughout their degree. I took a sample of first years, second 
years and third years and gave them five tests (scored out of 15) representing dif¬ 
ferent aspects of psychology: exper (experimental psychology, such as cognitive and 
neuropsychology); stats (statistics); social (social psychology); develop (developmen¬ 
tal psychology); person (personality). Your task is to: (1) carry out an appropriate 
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general analysis to determine whether there are overall group differences along these 
five measures; (2) look at the scale-by-scale analyses of group differences produced 
in the output and interpret the results accordingly; (3) select contrasts that test the 
hypothesis that second and third years will score higher than first years on all scales; 
(4) select tests that compare all groups to each other and briefly compare these results 
with the contrasts; and (5) carry out a separate analysis in which you test whether a 
combination of the measures can successfully discriminate the groups (comment only 
briefly on this analysis). Include only those scales that revealed group differences 
for the contrasts. How do the results help you to explain the findings of your initial 
analysis? The data are in the file psychology.dat. © 

Answers can be found on the companion website. 


Further reading 


Bray, J. H., & Maxwell, S. E. (1985). Multivariate analysis of variance. Sage University Paper Series 
on Quantitative Applications in the social Sciences, 07-054. Newbury Park, CA: Sage. (This 
monograph on MANOVA is superb: I cannot recommend anything better.) 

Huberty, C. J., & Morris, J. D. (1989). Multivariate analysis versus multiple univariate analysis. 
Psychological Bulletin, 105(2), 302-308. 


Interesting real research 


Marzillier, S. L., & Davey, G. C. L. (2005). Anxiety and disgust: Evidence for a unidirectional rela¬ 
tionship. Cognition and Emotion, 19(5), 729-750. 





Exploratory factor analysis 





FIGURE 17.1 

Me at Niagara 
Falls in 1998.1 
was in the middle 
of writing the first 
edition of the SPSS 
version of this 
book at the time. 
Note how fresh- 
faced I look 


17.1. What will this chapter tell me? © 


I was a year or so into my Ph.D., and, thanks to my initial terrible teaching experiences, I 
had developed a bit of an obsession with over-preparing for classes. I wrote detailed hand¬ 
outs and started using funny examples. Through my girlfriend at the time I met Dan Wright 
(a psychologist, who was in my department but sadly moved to Florida). He had published 
a statistics book of his own and was helping his publishers to sign up new authors. On the 
basis that my handouts were quirky and that I was too young to realize that writing a text¬ 
book at the age of 23 was academic suicide (really, textbooks take a long time to write and 
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they are not at all valued compared to research articles), I was duly signed up. The com¬ 
missioning editor was a man constantly on the verge of spontaneously combusting with 
intellectual energy. He can start a philosophical debate about literally anything: should he 
ever be trapped in a elevator he will be compelled to attempt to penetrate the occupants’ 
minds with probing arguments that the elevator doesn’t exist, that they don’t exist, and 
that their entrapment is an illusory construct generated by their erroneous beliefs in the 
physical world. Ultimately, though, he’d still be a man trapped in an elevator (with several 
exhausted corpses). A combination of his unfaltering self-confidence, my fear of social 
interactions with people I don’t know, and my utter bemusement that anyone would want 
me to write a book made me incapable of saying anything sensible to him. Ever. He must 
have thought that he had signed up an imbecile. He was probably right. (I find him less 
intimidating since thinking up the elevator scenario.) The trouble with agreeing to write 
books is that you then have to write them. For the next two years or so I found myself try¬ 
ing to juggle my research, a lectureship at the University of London, and writing a book. 
Had I been writing a book on heavy metal it would have been fine because all of the infor¬ 
mation was moshing away in my memory waiting to stage-dive out. Sadly, however, I had 
agreed to write a book on something that I new nothing about: statistics. I soon discovered 
that writing the book was like doing a factor analysis: in factor analysis we take a lot of 
information (variables) and the R program effortlessly reduces this mass of confusion into 
a simple message (fewer variables) that is easier to digest. The program does this (sort of) 
by filtering out the bits of the information overload that we don’t need to know about. It 
takes a few seconds. Similarly, my younger self took a mass of information about statistics 
that I didn’t understand and filtered it down into a simple message that I could understand: 
I became a living, breathing factor analysis ... except that, unlike R, it took me two years 
and some considerable effort. 


17.2. When to use factor analysis © 


In the social sciences we are often trying to measure things that cannot directly be meas¬ 
ured (so-called latent variables). For example, management researchers (or psychologists 
even) might be interested in measuring ‘burnout’, which is when someone who has been 
working very hard on a project (a book, for example) for a prolonged period of time sud¬ 
denly finds themselves devoid of motivation, inspiration, and wants to repeatedly head¬ 
butt their computer screaming ‘please Mike, unlock the door, let me out of the basement, I 
need to feel the soft warmth of sunlight on my skin!’. You can’t measure burnout directly: 
it has many facets. However, you can measure different aspects of burnout: you could 
get some idea of motivation, stress levels, whether the person has any new ideas and so 
on. Having done this, it would be helpful to know whether these differences really do 
reflect a single variable. Put another way, are these different variables driven by the same 
underlying variable? This chapter will look at factor analysis (and principal components 
analysis) - a technique for identifying groups or clusters of variables. This technique has 
three main uses: (1) to understand the structure of a set of variables (e.g., pioneers of 
intelligence such as Spearman and Thurstone used factor analysis to try to understand the 
structure of the latent variable ‘intelligence’); (2) to construct a questionnaire to measure 
an underlying variable (e.g., you might design a questionnaire to measure burnout); and 
(3) to reduce a data set to a more manageable size while retaining as much of the original 
information as possible (e.g., we saw in Chapter 7 that multicollinearity can be a problem 
in multiple regression, and factor analysis can be used to solve this problem by combining 
variables that are collinear). Through this chapter we’ll discover what factors are, how we 
find them, and what they tell us (if anything) about the relationship between the variables 
we’ve measured. 
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17.3. Factors © 


If we measure several variables, or ask someone several questions about 
themselves, the correlation between each pair of variables (or questions) can 
be arranged in what’s known as an R-matrix. An R-matrix is just a correla¬ 
tion matrix: a table of correlation coefficients between variables (in fact, we 
saw small versions of these matrices in Chapter 6). The diagonal elements of 
an .R-matrix are all ones because each variable will correlate perfectly with 
itself. The off-diagonal elements are the correlation coefficients between 
pairs of variables, or questions. 1 The existence of clusters of large correlation 
coefficients between subsets of variables suggests that those variables could 
be measuring aspects of the same underlying dimension. These underlying 
dimensions are known as factors (or latent variables ). By reducing a data set from a group 
of interrelated variables into a smaller set of factors, factor analysis achieves parsimony by 
explaining the maximum amount of common variance in a correlation matrix using the 
smallest number of explanatory constructs. 

There are numerous examples of the use of factor analysis in the social sciences. The 
trait theorists in psychology used factor analysis endlessly to assess personality traits. Most 
readers will be familiar with the extroversion-introversion and neuroticism traits meas¬ 
ured by Eysenck (1953). Most other personality questionnaires are based on factor analysis 
- notably Cattell’s (1966a) 16 personality factors questionnaire - and these inventories are 
frequently used for recruiting purposes in industry (and even by some religious groups). 
However, although factor analysis is probably most famous for being adopted by psycholo¬ 
gists, its use is by no means restricted to measuring dimensions of personality. Economists, 
for example, might use factor analysis to see whether productivity, profits and workforce 
can be reduced down to an underlying dimension of company growth. 

Let’s put some of these ideas into practice by imagining that we wanted to measure dif¬ 
ferent aspects of what might make a person popular. We could administer several measures 
that we believe tap different aspects of popularity. So, we might measure a person’s social 
skills (Social Skills), their selfishness (Selfish), how interesting others find them (Interest), 
the proportion of time they spend talking about the other person during a conversation 
(Talkl), the proportion of time they spend talking about themselves (Talk2), and their 
propensity to he to people (the Liar scale). We can then calculate the correlation coef¬ 
ficients for each pair of variables and create an R-matrix. Figure 17.2 shows this matrix. 
Any significant correlation coefficients are shown in bold type. It is clear that there are 
two clusters of interrelating variables. Therefore, these variables might be measuring some 
common underlying dimension. The amount that someone talks about the other person 
during a conversation seems to correlate highly with both the level of social skills and how 
interesting the other finds that person. Also, social skills correlate well with how interest¬ 
ing others perceive a person to be. These relationships indicate that the better your social 
skills, the more interesting and talkative you are likely to be. However, there is a second 
cluster of variables. The amount that people talk about themselves within a conversation 
correlates with how selfish they are and how much they lie. Being selfish also correlates 
with the degree to which a person tells lies. In short, selfish people are likely to lie and talk 
about themselves. 

In factor analysis we strive to reduce this R-matrix down into its underlying dimensions 
by looking at which variables seem to cluster together in a meaningful way. This data 



1 This matrix is called an R-matrix, or just R , because it contains correlation coefficients and r usually denotes 
Pearson’s correlation (see Chapter 6) - the r turns into a capital letter when it denotes a matrix. Given that this 
book is about some software called R, this is slightly confusing, so be careful - it should be obvious when we are 
talking about the program, and when I’m talking about the correlation matrix, and when it’s not, I’ll tell you. 
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FIGURE 17.2 

An R-matrix 



Talk 1 

Social Skills 

Interest 

Talk 2 

Selfish 

Liar 

Talk 1 

1.000 






Social Skills 


1.000 





Interest 

( .646 

.879J 

1.000 




Talk 2 

Factor 1 

.074 -.120 

.054 

1.000 



Selfish 

-.131 

.031 

-.101 

7.441' 

""<^000 


Liar 

.068 

.012 

.110 

/ .361 

.277 ) 

1.000 


Factor 2 


reduction is achieved by looking for variables that correlate highly with a group of other 
variables, but do not correlate with variables outside of that group. In this example, there 
appear to be two clusters that fit the bill. The first factor seems to relate to general sociabil¬ 
ity, whereas the second factor seems to relate to the way in which a person treats others 
socially (we might call it ‘consideration’). It might, therefore, be assumed that popularity 
depends not only on your ability to socialize, but also on whether you are genuine towards 
others. 


17.3.1. 


Graphical representation of factors (D 


Factors (not to be confused with independent variables in factorial ANOVA) are statistical 
entities that can be visualized as classification axes along which measurement variables can 
be plotted. In plain English, this statement means that if you imagine factors as being the 
axis of a graph, then we can plot variables along these axes. The coordinates of variables 
along each axis represent the strength of relationship between that variable and each factor. 
Figure 17.3 shows such a plot for the popularity data (in which there were only two factors). 
The first thing to notice is that for both factors, the axis line ranges from —1 to 1, which 
are the outer limits of a correlation coefficient. Therefore, the position of a given variable 
depends on its correlation with the two factors. The circles represent the three variables 
that correlate highly with factor 1 (Sociability: horizontal axis) but have a low correlation 
with factor 2 (Consideration: vertical axis). Conversely, the triangles represent variables that 
correlate highly with consideration to others but have a low correlation with sociability. 
From this plot, we can tell that selfishness, the amount a person talks about themselves and 
their propensity to lie all contribute to a factor that could be called consideration of others. 
Conversely, how much a person takes an interest in other people, how interesting they are 
and their level of social skills contribute to a second factor, sociability. This diagram there¬ 
fore supports the structure that was apparent in the R-matrix. Of course, if a third factor 
existed within these data it could be represented by a third axis (creating a 3-D graph). It 
should also be apparent that if more than three factors exist in a data set, then a 2-D drawing 
cannot represent them all. 

If each axis on the graph represents a factor, then the variables that go to make up a fac¬ 
tor can be plotted according to the extent to which they relate to a given factor. The coor¬ 
dinates of a variable, therefore, represent its relationship to the factors. In an ideal world 
a variable should have a large coordinate for one of the axes, and low coordinates for any 
other factors. This scenario would indicate that this particular variable related to only one 
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FIGURE 17.3 

Example of a 
factor plot 


factor. Variables that have large coordinates on the same axis are assumed to measure dif¬ 
ferent aspects of some common underlying dimension. The coordinate of a variable along 
a classification axis is known as a factor loading. The factor loading can be thought of as the 
Pearson correlation between a factor and a variable (see Jane Superbrain Box 17.1). From 
what we know about interpreting correlation coefficients (see section 6.5.4.3) it should be 
clear that if we square the factor loading we obtain a measure of the substantive impor¬ 
tance of a particular variable to a factor. 


17.3.2. 


Mathematical representation of factors © 


The axes drawn in Figure 17.3 are straight lines and so can be described mathematically 
by the equation of a straight line. Therefore, factors can also be described in terms of this 
equation. 




SELF-TEST 

s What is the equation of a straight line? 


The following equation reminds us of the equation describing a linear model and then 
applies this to the scenario of describing a factor: 

Y i = b l X li + b 2 X 2 ,-+■•• + b„ X m +£; 

Factor. = b. Variable,. + b , Variable,. + ...+ b Variable . + £. 

i 1 1 1 L Li n ni t 


(17.1) 
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You’ll notice that there is no intercept in the equation, the reason being that the lines intersect 
at zero (hence the intercept is also zero). The bs in the equation represent the factor loadings. 

Sticking with our example of popularity, we found that there were two factors underly¬ 
ing this construct: general sociability and consideration. We can, therefore, construct an 
equation that describes each factor in terms of the variables that have been measured. The 
equations are as follows: 

Y i = b 2 X u + b 2 X 2i +... + b n X ni +£j 
Sociability,■ = (^Talkl, + 6 2 Social Skills, + ^Interest, 

+ £> 4 Talk2,- + b 5 Selfish ( + £> 6 Liar, +e,- ^ ^ 

Consideration, = b x Talkl, + (? 2 Social Skills, + £> 3 Interest, 

+ £> 4 Talk2, + (? 5 Selfish- + (? 6 Liar, +e,- 

Notice that the equations are identical in form: they both include all of the variables that 
were measured. However, the values of b in the two equations will be different (depend¬ 
ing on the relative importance of each variable to the particular factor). In fact, we can 
replace each value of b with the coordinate of that variable on the graph in Figure 17.3 
(i.e., replace the values of b with the factor loading). The resulting equations are as follows: 


Y , ~ X u + b 2 X 2i +... + b n X m +£,- 
Sociability, = 0.87Talkl, + 0.96SocialSkills, + 0.92Interest, 
+ 0.00Talk2, - O.lOSelfish, + 0.09Liar, +s, 
Consideration, = O.OlTalkl, - 0.03SocialSkills, + 0.04Interest, 
+ 0.82Talk2, + 0.75Selfish, + 0.70Liar, +e, 


Observe that, for the sociability factor, the values of b are high for Talkl, Social Skills and 
Interest. For the remaining variables (Talk2, Selfish and Liar) the values of b are very low 
(close to 0). This tells us that three of the variables are very important for that factor (the 
ones with high values of b) and three are very unimportant (the ones with low values of b ). 
We saw that this point is true because of the way that three variables clustered highly on the 
factor plot. The point to take on board here is that the factor plot and these equations rep¬ 
resent the same thing: the factor loadings in the plot are simply the ^-values in these equa¬ 
tions (but see Jane Superbrain Box 17.1). For the second factor, inconsideration to others, 
the opposite pattern can be seen in that Talk2, Selfish and Liar all have high values of b 
whereas the remaining three variables have ^-values close to 0. In an ideal world, variables 
would have very high (^-values for one factor and very low ^-values for all other factors. 

These factor loadings can be placed in a matrix in which the columns represent each fac¬ 
tor and the rows represent the loadings of each variable on each factor. For the popularity 
data this matrix would have two columns (one for each factor) and six rows (one for each 
variable). This matrix, usually denoted A, is given by: 


0.87 

0.01 

0.96 

-0.03 

0.92 

0.04 

0.00 

0.82 

-0.10 

0.75 

0.09 
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To understand what the matrix means, try relating the elements to the loadings in equa¬ 
tion (17.3). For example, the top row represents the first variable, Talkl, which had a 
loading of .87 for the first factor (Sociability) and a loading of .01 for the second factor 
(Consideration). This matrix is called the factor matrix or component matrix (if doing prin¬ 
cipal components analysis) - see Jane Superbrain Box 17.1 to find out about the different 
forms of this matrix. 

The major assumption in factor analysis is that these algebraic factors represent real- 
world dimensions, the nature of which must be guessed at by inspecting which variables 
have high loads on the same factor. So, psychologists might believe that factors represent 
dimensions of the psyche, education researchers might believe they represent abilities, and 
sociologists might believe they represent races or social classes. However, it is an extremely 
contentious point whether this assumption is tenable, and some believe that the dimensions 
derived from factor analysis are real only in the statistical sense - and are real-world fictions. 




|ANE SUPERBRAIN 17.1 

What’s the difference between a pattern matrix 
and a structure matrix? ® 


Throughout my discussion of factor loadings I’ve been 
quite vague. Sometimes I’ve said that these loadings can 
be thought of as the correlation between a variable and 
a given factor, then at other times I’ve described these 
loadings in terms of regression coefficients (b). Now, it 
should be obvious from what we discovered in Chapters 
6 and 7 that correlation coefficients and regression coef¬ 
ficients are quite different things, so what the hell am I 
going on about: shouldn’t I make up my mind what the 
factor loadings actually are? 

Well, in vague terms (the best terms for my brain) both 
correlation coefficients and regression coefficients repre¬ 
sent the relationship between a variable and linear model 
in a broad sense, so the key take-home message is that 


factor loadings tell us about the relative contribution that 
a variable makes to a factor. As long as you understand 
that much, you have no problems. 

However, the factor loadings in a given analysis can 
be both correlation coefficients and regression coeffi¬ 
cients. Soon well discover that the interpretation of fac¬ 
tor analysis is helped greatly by a technique known as 
rotation. Without going into details, there are two types: 
orthogonal and oblique rotation (see section 17.3.9). 
When orthogonal rotation is used, any underlying factors 
are assumed to be independent, and the factor loading 
is the correlation between the factor and the variable, but 
is also the regression coefficient. Put another way, the 
values of the correlation coefficients are the same as the 
values of the regression coefficients. However, there are 
situations in which the underlying factors are assumed to 
be related or correlated to each other. In these situations, 
oblique rotation is used and the resulting correlations 
between variables and factors will differ from the corre¬ 
sponding regression coefficients. In this case, there are, 
in effect, two different sets of factor loadings: the correla¬ 
tion coefficients between each variable and factor (which 
are put in the factor structure matrix) and the regression 
coefficients for each variable on each factor (which are 
put in the factor pattern matrix). These coefficients can 
have quite different interpretations (see Graham, Guthrie, 
& Thompson, 2003). 


17.3.3. 


Factor scores © 


A factor can be described in terms of the variables measured and the relative importance 
of them for that factor (represented by the value of b ). Therefore, having discovered which 





756 


DISCOVERING STATISTICS USING R 


factors exist, and estimated the equation that describes them, it should be possible to also 
estimate a person’s score on a factor, based on their scores for the constituent variables. 
These scores are known as factor scores. As such, if we wanted to derive a score of socia¬ 
bility for a particular person, we could place their scores on the various measures into 
equation (17.3). This method is known as a weighted average. In fact, this method is overly 
simplistic and rarely used, but it is probably the easiest way to explain the principle. For 
example, imagine the six scales all range from 1 to 10 and that someone scored the follow¬ 
ing: Talkl (4), Social Skills (9), Interest (8), Talk2 (6), Selfish (8), and Liar (6). We could 
put these values into equation (17.3) to get a score for this person’s sociability and their 
consideration to others: 


Sociability = 0.87Talkl + 0.96Social Skills + 0.92Interest 
+ 0.00Talk2 - O.lOSelfish + 0.09Liar 
= (0.87 x 4) + (0.96 x 9) + (0.92 x 8) + (0.00 x 6) 
-(0.10x8)+ (0.09x6) 

= 19.22 

Consideration = O.OlTalkl - 0.03Social Skills + 0.04Interest 
+ 0.82Talk2 + 0.75Selfish + 0.70Liar 
= (0.01 x 4) - (0.03 x 9) + (0.04 x 8) + (0.82 x 6) 
+ (0.75x8)+ (0.70x6) 

= 15.21 


The resulting scores of 19.22 and 15.21 reflect the degree to which this person is sociable 
and their inconsideration to others, respectively. This person scores higher on sociability 
than inconsideration. However, the scales of measurement used will influence the resulting 
scores, and if different variables use different measurement scales, then factor scores for 
different factors cannot be compared. As such, this method of calculating factor scores is 
poor and more sophisticated methods are usually used. 


17.3.3.1. The regression method © 

There are several sophisticated techniques for calculating factor scores that use factor score 
coefficients as weights in equation (17.1) rather than using the factor loadings. The form 
of the equation remains the same, but the bs in the equation are replaced with these factor 
score coefficients. Factor score coefficients can be calculated in several ways. The simplest 
way is the regression method. In this method the factor loadings are adjusted to take 
account of the initial correlations between variables; in doing so, differences in units of 
measurement and variable variances are stabilized. 

To obtain the matrix of factor score coefficients ( B ) we multiply the matrix of factor load¬ 
ings by the inverse (iT *) of the original correlation or R-matrix. You might remember from 
the previous chapter that matrices cannot be divided (see section 16.4.4.1). Therefore, 
if we want to divide by a matrix it cannot be done directly and instead we multiply by 
its inverse. Therefore, by multiplying the matrix of factor loadings by the inverse of the 
correlation matrix we are, conceptually speaking, dividing the factor loadings by the cor¬ 
relation coefficients. The resulting factor score matrix, therefore, represents the relation¬ 
ship between each variable and each factor, taking into account the original relationships 
between pairs of variables. As such, this matrix represents a purer measure of the unique 
relationship between variables and factors. 

The matrices for the popularity data are shown below. The resulting matrix of factor score 
coefficients, B, comes from the R (the program) output. The matrices R 1 and A can be multiplied 
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by hand to get the matrix B, and those familiar with matrix algebra - or who have consulted 
Namboodiri, (1984) or Stevens (2002) - might like to verify the result (see Oliver Twisted). To 
get the same degree of accuracy as R you should work to at least five decimal places: 


B = R~ l A 


4.76 

-7.46 

-7.46 

18.49 

3.91 

-12.42 

-2.35 

5.45 

2.42 

-5.54 

-0.49 

1.22 

0.343 

0.006 

0.376 

-0.020 

0.362 

0.020 

0.000 

0.473 

-0.037 

0.437 

0.039 

0.405 


3.91 

-2.35 

2.42 

12.42 

5.45 

-5.54 

10.07 

-3.65 

3.79 

-3.65 

2.97 

-2.16 

3.79 

-2.16 

2.98 

-0.96 

0.02 

-0.56 


-0.49 n 

' 0.87 

0.01' 

1.22 

0.96 

-0.03 

-0.96 

0.92 

0.04 

0.02 

0.00 

0.82 

-0.56 

-0.10 

0.75 

1.27, 

, 0.09 

0.70, 


The pattern of the loadings is the same for the factor score coefficients: that is, the first 
three variables have high loadings for the first factor and low loadings for the second, 
whereas the pattern is reversed for the last three variables. The difference is only in the 
actual value of the weightings, which are smaller because the correlations between vari¬ 
ables are now accounted for. These factor score coefficients can be used to replace the 
(^-values in equation (17.2): 

Sociability = 0.343Talkl + 0.376SocialSkills + 0.362Interest 
+ 0.000Talk2 - 0.037Selfish + 0.039Liar 
= (0.343 x 4) + (0.376 x 9) + (0.362 x 8) + (0.000 x 6) 

-(0.037x8)+ (0.039x6) 

= 7.59 

Consideration = 0.006Talkl - 0.020SocialSkills + 0.020Interest (17.5) 

+ 0.473Talk2 + 0.437Selfish + 0.405Liar 
= (0.006 x 4) - (0.020 x 9) + (0.020 x 8) + (0.473 x 6) 

+ (0.437x8)+ (0.405x6) 

= 8.768 


Equation (17.5) shows how these coefficient scores are used to produce two factor scores 
for each person. In this case, the participant had the same scores on each variable as were used 
in equation (17.4). The resulting scores are much more similar than when the factor loadings 
were used as weights because the different variances among the six variables have now been 
controlled for. The fact that the values are very similar reflects the fact that this person not 
only scores highly on variables relating to sociability, but is also inconsiderate (i.e., they score 
equally highly on both factors). This technique for producing factor scores ensures that the 
resulting scores have a mean of 0 and a variance equal to the squared multiple correlation 
between the estimated factor scores and the true factor values. However, the downside of the 
regression method is that the scores can correlate not only with factors other than the one 
on which they are based, but also with other factor scores from a different orthogonal factor. 
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OLIVER TWISTED 

Please Sir, can I have 
some more ... matrix 
algebra? 


'The Matrix enthuses Oliver, that was a good film. I want to 
dress in black and glide through the air as though time has stood 
still. Maybe the matrix of factor scores is as cool as the film.’ I think 
you might be disappointed, Oliver, but well give it a shot. The matrix 
calculations of factor scores are detailed in the additional material for 
this chapter on the companion website. Be afraid, be very afraid ... 


17.3.3.2. Uses of factor scores © 


There are several uses of factor scores. First, if the purpose of the factor analysis is to 
reduce a large set of data into a smaller subset of measurement variables, then the factor 
scores tell us an individual’s score on this subset of measures. Therefore, any further ana¬ 
lysis can be carried out on the factor scores rather than the original data. For example, we 
could carry out a t-test to see whether females are significantly more sociable than males 
using the factor scores for sociability. A second use is in overcoming collinearity problems 
in regression. If, following a multiple regression analysis, we have identified sources of 
multicollinearity then the interpretation of the analysis is questioned (see section 7.7.2.3). 
In this situation, we can carry out a principal components analysis on the predictor vari¬ 
ables to reduce them down to a subset of uncorrelated factors. The variables causing the 
multicollinearity will combine to form a factor. If we then rerun the regression but using 
the factor scores as predictor variables then the problem of multicollinearity should vanish 
(because the variables are now combined into a single factor). 

By now, you should have some grasp of the concept of what a factor is, how it is repre¬ 
sented graphically, how it is represented algebraically, and how we can calculate composite 
scores representing an individual’s ‘performance’ on a single factor. I have deliberately 
restricted the discussion to a conceptual level, without delving into how we actually find 
these mythical beasts known as factors. This section will look at how we find factors. 
Specifically, we will examine different types of method, look at the maths behind one 
method (principal components), investigate the criteria for determining whether factors 
are important, and discover how to improve the interpretation of a given solution. 


17.3.4. 


Choosing a method © 


The first thing you need to know is that there are several methods for unearthing fac¬ 
tors in your data. The method you chose will depend on what you hope to do with the 
analysis. Tinsley and Tinsley (1987) give an excellent account of the different methods 
available. There are two things to consider: whether you want to generalize the findings 
from your sample to a population and whether you are exploring your data or testing 
a specific hypothesis. This chapter describes techniques for exploring data using factor 
analysis. Testing hypotheses about the structures of latent variables and their relationships 
to each other requires considerable complexity and can be done with packages such as sem 
or Lavaan in R. 2 Those interested in hypothesis testing techniques (known as confirmatory 


2 The sem package is the more straightforward, but is slightly less capable of handling unusual situations than 
Lavaan {sem was written by John Fox, who also wrote R Commander). 
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factor analysis) are advised to read Pedhazur and Schmelkin (1991, Chapter 23) for an 
introduction. 

Assuming we want to explore our data, we then need to consider whether we want to 
apply our findings to the sample collected (descriptive method) or to generalize our find¬ 
ings to a population (inferential methods). When factor analysis was originally developed it 
was assumed that it would be used to explore data to generate future hypotheses. As such, 
it was assumed that the technique would be applied to the entire population of interest. 
Therefore, certain techniques assume that the sample used is the population, and so results 
cannot be extrapolated beyond that particular sample. Principal components analysis is an 
example of one of these techniques, as is principal factors analysis (principal axis factoring). 
Principal components analysis and principal factors analysis are the preferred methods 
and usually result in similar solutions (see section 17.3.6). When these methods are used, 
conclusions are restricted to the sample collected and generalization of the results can be 
achieved only if analysis using different samples reveals the same factor structure. 

Another approach has been to assume that participants are randomly selected and that 
the variables measured constitute the population of variables in which we’re interested. By 
assuming this, it is possible to develop techniques from which the results can be general¬ 
ized from the sample participants to a larger population. However, a constraint is that any 
findings hold true only for the set of variables measured (because we’ve assumed this set 
constitutes the entire population of variables). Techniques in this category include the max¬ 
imum-likelihood method (see Harman, 1976) and Kaiser’s alpha factoring. The choice of 
method depends largely on what generalizations, if any, you want to make from your data. 3 


17.3.5. 


Communality (D 


Before continuing, it is important that you understand some basic things about the variance 
within an K-matrix. It is possible to calculate the variability in scores (the variance) for any 
given measure (or variable). You should be familiar with the idea of variance by now and 
comfortable with how it can be calculated (if not, see Chapter 2). The total variance for 
a particular variable will have two components: some of it will be shared with other vari¬ 
ables or measures (common variance) and some of it will be specific to that measure (unique 
variance). We tend to use the term unique variance to refer to variance that can be reliably 
attributed to only one measure. However, there is also variance that is specific to one meas¬ 
ure but not reliably so; this variance is called error or random variance. The proportion of 
common variance present in a variable is known as the communality. As such, a variable that 
has no specific variance (or random variance) would have a communality of 1; a variable 
that shares none of its variance with any other variable would have a communality of 0. 

In factor analysis we are interested in finding common underlying dimensions within 
the data and so we are primarily interested only in the common variance. Therefore, when 
we run a factor analysis it is fundamental that we know how much of the variance present 
in our data is common variance. This presents us with a logical impasse: to do the factor 
analysis we need to know the proportion of common variance present in the data, yet the 
only way to find out the extent of the common variance is by carrying out a factor analysis. 
There are two ways to approach this problem. The first is to assume that all of the variance 
is common variance. As such, we assume that the communality of every variable is 1. By 
making this assumption we merely transpose our original data into constituent linear com¬ 
ponents (known as principal components analysis). The second approach is to estimate the 


3 It’s worth noting at this point that principal component analysis is not in fact the same as factor analysis. This 
doesn’t stop idiots like me from discussing them as though they are, but more on that later. 
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amount of common variance by estimating communality values for each variable. There 
are various methods of estimating communalities, but the most widely used (including 
alpha factoring) is to use the squared multiple correlation (SMC) of each variable with all 
others. So, for the popularity data, imagine you ran a multiple regression using one meas¬ 
ure (Selfish) as the outcome and the other five measures as predictors: the resulting multi¬ 
ple R 2 (see section 7.6.2) would be used as an estimate of the communality for the variable 
Selfish. This second approach is used in factor analysis. These estimates allow the factor 
analysis to be done. Once the underlying factors have been extracted, new communalities 
can be calculated that represent the multiple correlation between each variable and the 
factors extracted. Therefore, the communality is a measure of the proportion of variance 
explained by the extracted factors. 


17 . 3 . 6 . 


Factor analysis vs. principal components analysis © 


I have just explained that there are two approaches to locating underlying dimensions 
of a data set: factor analysis and principal components analysis. These techniques differ 
in the communality estimates that are used. Simplistically, though, factor analysis derives 
a mathematical model from which factors are estimated, whereas principal components 
analysis merely decomposes the original data into a set of linear variates - see Dunteman, 
(1989) and Widaman (2007) for more detail on the differences between the procedures. 
As such, only factor analysis can estimate the underlying factors, and it relies on various 
assumptions for these estimates to be accurate. Principal components analysis is concerned 
only with establishing which linear components exist within the data and how a particular 
variable might contribute to that component. In terms of theory, this chapter is dedicated 
to principal components analysis rather than factor analysis. The reasons are that principal 
components analysis is a psychometrically sound procedure, is conceptually less complex 
than factor analysis, and bears numerous similarities to discriminant analysis (described in 
the previous chapter). 

However, we should consider whether the techniques provide different solutions to the 
same problem. Based on an extensive literature review, Guadagnoli and Velicer (1988) 
concluded that the solutions generated from principal components analysis differ little 
from those derived from factor analysis techniques. In reality, there are some circumstances 
for which this statement is untrue. Stevens (2002) summarizes the evidence and concludes 
that, with 30 or more variables and communalities greater than .7 for all variables, differ¬ 
ent solutions are unlikely; however, with fewer than 20 variables and any low communali¬ 
ties (< .4), differences can occur. 

The flip-side of this argument is eloquently described by Cliff (1987) who observed that 
proponents of factor analysis ‘insist that components analysis is at best a common factor 
analysis with some error added and at worst an unrecognizable hodgepodge of things from 
which nothing can be determined’ (p. 349). Indeed, feeling is strong on this issue, with 
some arguing that when principal components analysis is used it should not be described 
as a factor analysis and that you should not impute substantive meaning to the resulting 
components. However, to non-statisticians the difference between a principal component 
and a factor may be difficult to conceptualize (they are both linear models), and the differ¬ 
ences arise largely from the calculation. 4 


4 For this reason I have used the terms components and factors interchangeably throughout this chapter. Although 
this use of terms will reduce some statisticians (and psychologists) to tears, I’m banking on these people not 
needing to read this book. I acknowledge the methodological differences, but I think it’s easier for students if I 
dwell on the similarities between the techniques and not the differences. 
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17 . 3 . 7 . 


Theory behind principal components analysis © 


Principal components analysis works in a very similar way to MANOVA and discrimi¬ 
nant function analysis (see previous chapter). Although it isn’t necessary to understand the 
mathematical principles in any detail, readers of the previous chapter may benefit from 
some comparisons between the two techniques. For those who haven’t read that chapter, I 
suggest you flick through it before moving ahead! 

In MANOVA, various sum of squares and cross-product matrices were calculated that 
contained information about the relationships between dependent variables. I mentioned 
before that these SSCP matrices could be easily converted to variance-covariance matrices, 
which represent the same information but in averaged form (i.e., taking account of the 
number of observations). I also said that by dividing each element by the relevant standard 
deviation the variance-covariance matrices become standardized. The result is a correla¬ 
tion matrix. In principal components analysis we usually deal with correlation matrices 
(although it is possible to analyse a variance-covariance matrix too), and the point to 
note is that this matrix pretty much represents the same information as an SSCP matrix in 
MANOVA. The difference is just that the correlation matrix is an averaged version of the 
SSCP that has been standardized. 

In MANOVA, we used several SSCP matrices that represented different components 
of experimental variation (the model variation and the residual variation). In principal 
components analysis the covariance (or correlation) matrix cannot be broken down in this 
way (because all data come from the same group of participants). In MANOVA, we ended 
up looking at the variates or components of the SSCP matrix that represented the ratio 
of the model variance to the error variance. These variates were linear dimensions that 
separated the groups tested, and we saw that the dependent variables mapped onto these 
underlying components. In short, we looked at whether the groups could be separated by 
some linear combination of the dependent variables. These variates were found by calcu¬ 
lating the eigenvectors of the SSCP. The number of variates obtained was the smaller of 
p (the number of dependent variables) and k — 1 (where k is the number of groups). In 
component analysis we do something similar (I’m simplifying things a little, but it will give 
you the basic idea). That is, we take a correlation matrix and calculate the variates. There 
are no groups of observations, and so the number of variates calculated will always equal 
the number of variables measured ( p ). The variates are described, as for MANOVA, by the 
eigenvectors associated with the correlation matrix. The elements of the eigenvectors are 
the weights of each variable on the variate (see equation (16.5)). These values are the factor 
loadings described earlier. The largest eigenvalue associated with each of the eigenvectors 
provides a single indicator of the substantive importance of each variate (or component). 
The basic idea is that we retain factors with relatively large eigenvalues and ignore those 
with relatively small eigenvalues. 

The eigenvalue for a factor can also be calculated by summing the square of the loadings 
for that factor. This isn’t much use if you’re calculating factor analysis, because you need 
to calculate the eigenvalues to calculate the loadings. But it can be a useful way to help 
understand the eigenvalues - the higher the loadings on a factor, the more of the variance 
in the variables that the factor explains. 

In summary, component analysis works in a similar way to MANOVA. We begin with 
a matrix representing the relationships between variables. The linear components (also 
called variates, or factors) of that matrix are then calculated by determining the eigenvalues 
of the matrix. These eigenvalues are used to calculate eigenvectors, the elements of which 
provide the loading of a particular variable on a particular factor (i.e., they are the ^-values 
in equation (17.1)). The eigenvalue is also a measure of the substantive importance of the 
eigenvector with which it is associated. 
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17 . 3 . 8 . 


Factor extraction: eigenvalues and the scree plot © 


Not all factors are retained in an analysis, and there is debate over the 
criterion used to decide whether a factor is statistically important. I 
mentioned above that the eigenvalues associated with a variate indicate 
the substantive importance of that factor. Therefore, it seems logical 
that we should retain only factors with large eigenvalues. Retaining fac¬ 
tors is known as factor extraction. How do we decide whether or not an 
eigenvalue is large enough to represent a meaningful factor? Well, one 
technique advocated by Cattell (1966b) is to plot a graph of each eigen¬ 
value (Y-axis) against the factor with which it is associated (X-axis). 
This graph is known as a scree plot (because it looks like a rock face 
with a pile of debris, or scree, at the bottom). I mentioned earlier that 
it is possible to obtain as many factors as there are variables and that each has an 
associated eigenvalue. By graphing the eigenvalues, the relative importance of each 
factor becomes apparent. Typically there will be a few factors with quite high eigen¬ 
values, and many factors with relatively low eigenvalues, and so this graph has a very 
characteristic shape: there is a sharp descent in the curve followed by a tailing off 
(see Figure 17.4). Cattell (1966b) argued that the cut-off point for selecting factors 
should be at the point of inflexion of this curve. The point of inflexion is where the 
slope of the line changes dramatically: so, in Figure 17.4, imagine drawing a straight 
line that summarizes the vertical part of the plot and another that summarizes the 
horizontal part (the blue dashed lines); then the point of inflexion is the data point 
at which these two lines meet. In both examples in Figure 17.4 the point of inflex¬ 
ion occurs at the third data point (factor); therefore, we would extract two factors. 
Thus, you retain (or extract) only factors to the left of the point of inflexion (and do 
not include the factor at the point of inflexion itself). 5 With a sample of more than 
200 participants, the scree plot provides a fairly reliable criterion for factor selection 
(Stevens, 2002). 

Although scree plots are very useful, factor selection should not be based on this cri¬ 
terion alone. Kaiser (1960) recommended retaining all factors with eigenvalues greater 
than 1. This criterion is based on the idea that the eigenvalues represent the amount of 
variation explained by a factor and that an eigenvalue of 1 represents a substantial amount 
of variation. Jolliffe (1972, 1986) reports that Kaiser’s criterion is too strict and suggests 
the third option of retaining all factors with eigenvalues greater than .7. The difference 
between how many factors are retained using Kaiser’s methods compared to Jolliffe’s can 
be dramatic. 

You might well wonder how the methods compare. Generally speaking, Kaiser’s cri¬ 
terion overestimates the number of factors to retain (see Jane Superbrain Box 17.2) 
but there is some evidence that it is accurate when the number of variables is less than 
30 and the resulting communalities (after extraction) are all greater than .7. Kaiser’s 
criterion can also be accurate when the sample size exceeds 250 and the average com- 
munality is greater than or equal to .6. In any other circumstances you are best advised 
to use a scree plot provided the sample size is greater than 200 (see Stevens, 2002, for 
more detail). 



5 Actually, in his original paper, Cattell advised including the factor at the point of inflexion as well because it is 
‘desirable to include at least one common error factor as a “garbage can”’. The idea is that the point of inflexion 
represents an error factor. However, in practice this garbage can factor is rarely retained; also Thurstone argued 
that it is better to retain too few than too many factors, so most people do not retain the factor at the point of 
inflexion. 
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FIGURE 17.4 

Examples of scree plots for data that probably have two underlying factors 


However, as is often the case in statistics, the three criteria often provide different 
solutions. In these situations the communalities of the factors need to be considered. In 
principal components analysis we begin with communalities of 1 with all factors retained 
(because we assume that all variance is common variance). At this stage all we have done 
is to find the linear variates that exist in the data - so we have just transformed the data 
without discarding any information. However, to discover what common variance really 
exists between variables we must decide which factors are meaningful and discard any that 
are too trivial to consider. Therefore, we discard some information. The factors we retain 
will not explain all of the variance in the data (because we have discarded some informa¬ 
tion) and so the communalities after extraction will always be less than 1. The factors 
retained do not map perfectly onto the original variables - they merely reflect the common 
variance present in the data. If the communalities represent a loss of information then they 
are important statistics. The closer the communalities are to 1, the better our factors are 
at explaining the original data. It is logical that the greater the number of factors retained, 
the greater the communalities will be (because less information is discarded); therefore, the 
communalities are good indices of whether too few factors have been retained. In fact, with 
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generalized least-squares factor analysis and maximum-likelihood factor analysis you can 
get a statistical measure of the goodness of fit of the factor solution (see the next chapter 
for more on goodness-of-fit tests). This basically measures the proportion of variance that 
the factor solution explains (so can be thought of as comparing communalities before and 
after extraction). 

As a final word of advice, your decision on how many factors to extract will depend 
also on why you’re doing the analysis; for example, if you’re trying to overcome multicol- 
linearity problems in regression, then it might be better to extract too many factors than 
too few. 



|ANE SUPERBRAIN 17.2 

How many factors do I retain? (D 


The discussion of factor extraction in the text is somewhat 
simplified. In fact, there are fundamental problems with 
Kaiser’s criterion (Nunnally & Bernstein, 1994; Preacher 
& MacCallum, 2003). For one thing an eigenvalue of 1 
means different things in different analyses: with 100 vari¬ 
ables it means that a factor explains 1% of the variance, 
but with 10 variables it means that a factor explains 10% 
of the variance. Clearly, these two situations are very dif¬ 
ferent and a single rule that covers both is inappropri¬ 
ate. An eigenvalue of 1 also means only that the factor 


explains as much variance as a variable, which rather 
defeats the original intention of the analysis to reduce 
variables down to ‘more substantive’ underlying factors 
(Nunnally & Bernstein, 1994). Consequently, Kaiser’s cri¬ 
terion often overestimates the number of factors. On this 
basis Jolliffe's criterion is even worse (a factor explains 
less variance than a variable!). 

There are other ways to determine how many fac¬ 
tors to retain, but they are more complex (which is why 
I’m discussing them outside of the main text). The best is 
probably parallel analysis (Horn, 1965). Essentially each 
eigenvalue (which represents the size of the factor) is com¬ 
pared against an eigenvalue for the corresponding factor 
in many randomly generated data sets that have the same 
characteristics as the data being analysed. In doing so, 
each eigenvalue is being compared to an eigenvalue from 
a data set that has no underlying factors. This is a bit like 
asking whether our observed factor is bigger than a non¬ 
existing factor. Factors that are bigger than their ‘random’ 
counterparts are retained. Of parallel analysis, the scree 
plot and Kaiser’s criterion, Kaiser’s criterion is, in general, 
worst and parallel analysis best (Zwick & Velicer, 1986). 


17.3.9. 


Improving interpretation: factor rotation (D 


Once factors have been extracted, it is possible to calculate to what degree variables load 
on these factors (i.e., calculate the loading of the variable on each factor). Generally, you 
will find that most variables have high loadings on the most important factor and small 
loadings on all other factors. This characteristic makes interpretation difficult, and so a 
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technique called factor rotation is used to discriminate between factors. If a factor is a 
classification axis along which variables can be plotted, then factor rotation effectively 
rotates these factor axes such that variables are loaded maximally on only one factor. 
Figure 17.5 demonstrates how this process works using an example in which there are 
only two factors. 

Imagine that a sociologist was interested in classifying university lecturers as a demo¬ 
graphic group. She discovered that two underlying dimensions best describe this group: 
alcoholism and achievement (go to any academic conference and you’ll see that academics 
drink heavily). The first factor, alcoholism, has a cluster of variables associated with it (dark 
blue circles), and these could be measures such as the number of units drunk in a week, 
dependency and obsessive personality. The second factor, achievement, also has a cluster of 
variables associated with it (light blue circles), and these could be measures relating to salary, 
job status and number of research publications. Initially, the full lines represent the factors, 
and by looking at the coordinates it should be clear that the light blue circles have high 
loadings for factor 2 (they are a long way up this axis) and medium loadings for 
factor 1 (they are not very far up this axis). Conversely, the dark blue circles have 
high loadings for factor 1 and medium loadings for factor 2. By rotating the axes 
(dashed lines), we ensure that both clusters of variables are intersected by the fac¬ 
tor to which they relate most. So, after rotation, the loadings of the variables are 
maximized on one factor (the factor that intersects the cluster) and minimized 
on the remaining factor(s). If an axis passes through a cluster of variables, then 
these variables will have a loading of approximately zero on the opposite axis. If 
this idea is confusing, then look at Figure 17.5 and think about the values of the 
coordinates before and after rotation (this is best achieved by turning the book 
when you look at the rotated axes). 

There are two types of rotation that can be done. The first is orthogonal rotation, and 
the left-hand side of Figure 17.5 represents this method. In Chapter 10 we saw that the 
term orthogonal means unrelated, and in this context it means that we rotate factors while 
keeping them independent, or unrelated. Before rotation, all factors are independent (i.e., 
they do not correlate at all) and orthogonal rotation ensures that the factors remain uncor¬ 
related. That is why in Figure 17.5 the axes are turned while remaining perpendicular. 6 The 
other form of rotation is oblique rotation. The difference with oblique rotation is that the 
factors are allowed to correlate (hence, the axes of the right-hand diagram of Figure 17.5 
do not remain perpendicular). 

The choice of rotation depends on whether there is a good theoretical reason to sup¬ 
pose that the factors should be related or independent (but see my later comments on this), 
and also how the variables cluster on the factors before rotation. On the first point, we 
might not expect alcoholism to be completely independent of achievement (after all, high 
achievement leads to high stress, which can lead to the drinks cabinet!). Therefore, on 
theoretical grounds, we might choose oblique rotation. On the second point, Figure 17.5 
demonstrates how the positioning of clusters is important in determining how successful 
the rotation will be (note the position of the light blue circles). Specifically, if an orthogonal 
rotation was carried out on the right-hand diagram it would be considerably less successful 
in maximizing loadings than the oblique rotation that is displayed. One approach is to run 
the analysis using both types of rotation. Pedhazur and Schmelkin (1991) suggest that if the 
oblique rotation demonstrates a negligible correlation between the extracted factors then 
it is reasonable to use the orthogonally rotated solution. If the oblique rotation reveals a 
correlated factor structure, then the orthogonally rotated solution should be discarded. In 


Do we have 
rotate? 



6 This term means that the axes are at right angles to one another. 
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FIGURE 17.5 

Schematic 
representations of 
factor rotation. The 
left graph displays 
orthogonal 
rotation whereas 
the right graph 
displays oblique 
rotation (see text 
for more details). 

6 is the angle 
through which the 
axes are rotated 
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any case, an oblique rotation should be used only if there are good reasons to suppose that 
the underlying factors could be related in theoretical terms. 

The mathematics behind factor rotation is complex (especially oblique rota¬ 
tion). However, in oblique rotation, because each factor can be rotated by different 
amounts, a factor transformation matrix, A is needed. The factor transformation matrix 
is a square matrix and its size depends on how many factors were extracted from the 
data. If two factors are extracted then it will be a 2 x 2 matrix, but if four factors 
are extracted then it becomes a 4 x 4 matrix. The values in the factor transformation 
matrix consist of sines and cosines of the angle of axis rotation (6). This matrix is 
multiplied by the matrix of unrotated factor loadings, A, to obtain a matrix of rotated 
factor loadings. 

For the case of two factors the factor transformation matrix would be: 


A= 


"cos 6 
^sin@ 


-sinlT 

COS0 , 


Therefore, you should think of this matrix as representing the angle through which the 
axes have been rotated, or the degree to which factors have been rotated. The angle of 
rotation necessary to optimize the factor solution is found in an iterative way (see R’s Souls’ 
Tip 8.1) and different methods can be used. 


17.3.9.1. Choosing a method of factor rotation (D 

The R function that we will use has four methods of orthogonal rotation (varimax, quarti- 
max, BentlerT and geominT) and five methods of oblique rotation (oblimin, promax, sim- 
plimax, BentlerQ and geominQ). These methods differ in how they rotate the factors and, 
therefore, the resulting output depends on which method you select. 

The most important orthogonal rotations are quartimax and varimax. Quartimax rota¬ 
tion attempts to maximize the spread of factor loadings for a variable across all factors. 
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Therefore, interpreting variables becomes easier. However, this often results in lots of vari¬ 
ables loading highly on a single factor. Varimax is the opposite in that it attempts to maxi¬ 
mize the dispersion of loadings within factors. Therefore, it tries to load a smaller number 
of variables highly on each factor, resulting in more interpretable clusters of factors. For a 
first analysis, you should probably select varimax because it is a good general approach that 
simplifies the interpretation of factors. 

The two important oblique rotations are promax and oblimin. Promax is a faster 
procedure designed for very large data sets. (If you are interested in adjustments that can 
be made to these rotations, other rotations, and even hand rotations, you can consult the 
GPARotate() function, found in the psych package.) 

In theory, the exact choice of rotation will depend largely on whether or not you think 
that the underlying factors should be related. If you expect the factors to be independ¬ 
ent then you should choose one of the orthogonal rotations (I recommend varimax). If, 
however, there are theoretical grounds for supposing that your factors might correlate, 
then direct oblimin should be selected. In practice, there are strong grounds to believe that 
orthogonal rotations are a complete nonsense for naturalistic data, and certainly for any 
data involving humans (can you think of any psychological construct that is not in any way 
correlated with some other psychological construct?). As such, some argue that orthogonal 
rotations should never be used. 


17.3.9.2. Substantive importance of factor loadings © 

Once a factor structure has been found, it is important to decide which variables make 
up which factors. Earlier I said that the factor loadings were a gauge of the substantive 
importance of a given variable to a given factor. Therefore, it makes sense that we use these 
values to place variables with factors. It is possible to assess the statistical significance of 
a factor loading (after all, it is simply a correlation coefficient or regression coefficient); 
however, there are various reasons why this option is not as easy as it seems (see Stevens, 
2002, p. 393). Typically, researchers take a loading of an absolute value of more than 0.3 
to be important. However, the significance of a factor loading will depend on the sam¬ 
ple size. Stevens (2002) produced a table of critical values against which loadings can be 
compared. To summarize, he recommends that for a sample size of 50 a loading of 0.722 
can be considered significant, for 100 the loading should be greater than 0.512, for 200 
it should be greater than 0.364, for 300 it should be greater than 0.298, for 600 it should 
be greater than 0.21, and for 1000 it should be greater than 0.162. These values are based 
on an alpha level of .01 (two-tailed), which allows for the fact that several loadings will 
need to be tested (see Stevens, 2002, for further detail). Therefore, in very large samples, 
small loadings can be considered statistically meaningful. (R can provide significance tests 
of factor loadings, but these get rather complex and are rarely used. By applying Stevens’s 
guidelines you should gain some insight into the structure of variables and factors.) 

The significance of a loading gives little indication of the substantive importance of 
a variable to a factor. This value can be found by squaring the factor loading to give an 
estimate of the amount of variance in a factor accounted for by a variable (like R 2 ). In this 
respect Stevens (2002) recommends interpreting only factor loadings with an absolute 
value greater than 0.4 (which explain around 16% of the variance in the variable). 


17.4. Research example © 


One of the uses of factor analysis is to develop questionnaires: after all, if you want to meas¬ 
ure an ability or trait, you need to ensure that the questions asked relate to the construct 
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FIGURE 17.6 

The R anxiety 
questionnaire 
(RAQ) 


SD 

= Strongly Disagree, D = Disagree, N = Neither, A = Agree, SA = 

Strongly Agree 






SD 

D 

N 

A 

SA 

1 

Statistics make me cry 

O 

O 

o 

o 

o 

2 

My friends will think I'm stupid for not being able to 
cope with R 

O 

O 

o 

o 

o 

3 

Standard deviations excite me 

O 

O 

o 

o 

o 

4 

1 dream that Pearson is attacking me with correlation 
coefficients 

O 

O 

o 

o 

o 

5 

1 don’t understand statistics 

O 

O 

o 

o 

o 

6 

1 have little experience of computers 

O 

o 

o 

o 

o 

7 

All computers hate me 

O 

o 

o 

o 

o 

8 

1 have never been good at mathematics 

O 

o 

o 

o 

o 

9 

My friends are better at statistics than me 

O 

o 

o 

o 

o 

10 

Computers are useful only for playing games 

O 

o 

o 

o 

o 

11 

1 did badly at mathematics at school 

O 

o 

o 

o 

o 

12 

People try to tell you that R makes statistics easier to 
understand but it doesn’t 

O 

o 

o 

o 

o 

13 

1 worry that 1 will cause irreparable damage because of my 
incompetence with computers 

O 

o 

o 

o 

o 

14 

Computers have minds of their own and deliberately go 
wrong whenever 1 use them 

O 

o 

o 

o 

o 

15 

Computers are out to get me 

O 

o 

o 

o 

o 

16 

1 weep openly at the mention of central tendency 

O 

o 

o 

o 

o 

17 

1 slip into a coma whenever 1 see an equation 

O 

o 

o 

o 

o 

18 

R always crashes when 1 try to use it 

O 

o 

o 

o 

o 

19 

Everybody looks at me when 1 use R 

O 

o 

o 

o 

o 

20 

1 can’t sleep for thoughts of eigenvectors 

O 

o 

o 

o 

o 

21 

1 wake up under my duvet thinking that 1 am trapped under a 
normal distribution 

o 

o 

o 

o 

o 

22 

My friends are better at R than 1 am 

o 

o 

o 

o 

o 

23 

If 1 am good at statistics people will think 1 am a nerd 

o 

o 

o 

o 

o 



that you intend to measure. I have noticed that a lot of students become very stressed 
about R. Therefore I wanted to design a questionnaire to measure a trait that I termed ‘R 
anxiety’. I decided to devise a questionnaire to measure various aspects of students’ anxi¬ 
ety towards learning R. I generated questions based on interviews with anxious and non- 
anxious students and came up with 23 possible questions to include. Each question was a 
statement followed by a five-point Likert scale ranging from ‘strongly disagree’ through 
‘neither agree nor disagree’ to ‘strongly agree’. The questionnaire is printed in Figure 17.6. 

The questionnaire was designed to predict how anxious a given individual would be 
about learning how to use R. What’s more, I wanted to know whether anxiety about R 
could be broken down into specific forms of anxiety. In other words, what latent variables 
contribute to anxiety about R? With a little help from a few lecturer friends I collected 
2571 completed questionnaires (at this point it should become apparent that this example 
is fictitious). The data are stored in the file RAQ.dat. Load this file into R and have a look 
at the data. We know that in R, cases (or people’s data) are typically stored in rows and 
variables are stored in columns and so this layout is consistent with past chapters. The sec¬ 
ond thing to notice is that there are 23 variables labelled Q01 to Q23. 
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OLIVER TWISTED 

Please Sir, can I have 
some more ... 
questionnaires? 


‘I'm going to design a questionnaire to measure one’s propensity 
to pick a pocket or two’, says Oliver, ’but how would I go about 
doing it?’ You’d read the useful information about the dos and 
don’ts of questionnaire design in the additional material for this 
chapter on the companion website, that’s how. Rate how useful it 
is on a Likert scale from 1 = not useful at all, to 5 = very useful. 


17.4.1. 


Sample size © 


Correlation coefficients fluctuate from sample to sample, much more so in small samples 
than in large. Therefore, the reliability of factor analysis is also dependent on sample size. 
Much has been written about the necessary sample size for factor analysis, resulting in 
many ‘rules of thumb’. The common rule is to suggest that a researcher has at least 10-15 
participants per variable. Although I’ve heard this rule bandied about on numerous occa¬ 
sions, its empirical basis is unclear (although Nunnally, 1978, did recommend having 10 
times as many participants as variables). Kass and Tinsley (1979) recommended having 
between 5 and 10 participants per variable up to a total of 300 (beyond which test param¬ 
eters tend to be stable regardless of the participant to variable ratio). Indeed, Tabachnick 
and Fidell (2007) agree that ‘it is comforting to have at least 300 cases for factor analysis’ 
(p. 613), and Comrey and Lee (1992) class 300 as a good sample size, 100 as poor and 
1000 as excellent. 

Fortunately, recent years have seen empirical research done in the form of experiments 
using simulated data (so-called Monte Carlo studies). Arrindell and van der Ende (1985) 
used real-life data to investigate the effect of different participant to variable ratios. They 
concluded that changes in this ratio made little difference to the stability of factor solu¬ 
tions. Guadagnoli and Velicer (1988) found that the most important factors in determining 
reliable factor solutions were the absolute sample size and the absolute magnitude of fac¬ 
tor loadings. In short, they argue that if a factor has four or more loadings greater than .6 
then it is reliable regardless of sample size. Furthermore, factors with 10 or more loadings 
greater than .40 are reliable if the sample size is greater than 150. Finally, factors with a few 
low loadings should not be interpreted unless the sample size is 300 or more. MacCallum, 
Widaman, Zhang, and Flong (1999) have shown that the minimum sample size or sample 
to variable ratio depends on other aspects of the design of the study. In short, their study 
indicated that as communalities become lower the importance of sample size increases. 
With all communalities above .6, relatively small samples (less than 100) may be perfectly 
adequate. With communalities in the .5 range, samples between 100 and 200 can be good 
enough provided there are relatively few factors each with only a small number of indi¬ 
cator variables. In the worst scenario of low communalities (well below .5) and a larger 
number of underlying factors they recommend samples above 500. 

What’s clear from this work is that a sample of 300 or more will probably provide a 
stable factor solution, but that a wise researcher will measure enough variables to 
adequately measure all of the factors that theoretically they would expect to find. 

Another alternative is to use the Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy 
(Kaiser, 1970). The KMO can be calculated for individual and multiple variables and rep¬ 
resents the ratio of the squared correlation between variables to the squared partial correla¬ 
tion between variables. The KMO statistic varies between 0 and 1. A value of 0 indicates 
that the sum of partial correlations is large relative to the sum of correlations, indicating 
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diffusion in the pattern of correlations (hence, factor analysis is likely to be inappropriate). 
A value close to 1 indicates that patterns of correlations are relatively compact and so fac¬ 
tor analysis should yield distinct and reliable factors. Kaiser (1974) recommends accepting 
values greater than .5 as barely acceptable (values below this should lead you to either col¬ 
lect more data or rethink which variables to include). Furthermore, values between .5 and 
.7 are mediocre, values between .7 and .8 are good, values between .8 and .9 are great and 
values above .9 are superb (Hutcheson & Sofroniou, 1999). 


17.4.2. 


Correlations between variables (D 


When I was an undergraduate, my statistics lecturer always used to say ‘if you put garbage in, 
you get garbage out’. This saying applies particularly to factor analysis, because R will usually 
find a factor solution for a set of variables. However, the solution is unlikely to have any real 
meaning if the variables analysed are not sensible. The first thing to do when conducting a 
factor analysis or principal components analysis is to look at the correlations of the variables. 
There are essentially two potential problems: (1) correlations that are not high enough; and 
(2) correlations that are too high. The correlations between variables can be checked using 
the cor() function (see Chapter 6) to create a correlation matrix of all variables. In both cases 
the remedy is to remove variables from the analysis. We will look at each problem in turn. 

If our test questions measure the same underlying dimension (or dimensions) then we 
would expect them to correlate with each other (because they are measuring the same thing). 
Even if questions measure different aspects of the same things (e.g., we could measure over¬ 
all anxiety in terms of sub-components such as worry, intrusive thoughts and physiological 
arousal), there should still be high correlations between the variables relating to these sub¬ 
traits. We can test for this problem first by visually scanning the correlation matrix and look¬ 
ing for correlations below about .3: if any variables have lots of correlations below this value 
then consider excluding them. It should be immediately clear that this approach is very 
subjective: I’ve used fuzzy terms such as ‘about .3’ and ‘lots of’, but I have to because every 
data set is different. Analysing data really is a skill, not a matter of following a recipe book. 

If you want an objective test of whether correlations (overall) are too small then you 
can test for a very extreme scenario. If the variables in our correlation matrix did not cor¬ 
relate at all, then our correlation matrix would be an identity matrix (i.e., the off-diagonal 
components are zero - see section 16.4.2). Bartlett’s test examines whether the popula¬ 
tion correlation matrix resembles an identity matrix. If the population correlation matrix 
resembles an identity matrix then it means that every variable correlates very badly with 
all other variables (i.e., all correlation coefficients are close to zero). If it were an identity 
matrix then it would mean that all variables are perfectly independent of one another (all 
correlation coefficients are zero). Given that we are looking for clusters of variables that 
measure similar things, it should be obvious why this scenario is problematic: if no vari¬ 
ables correlate then there are no clusters to find. Bartlett’s test tells us whether our correla¬ 
tion matrix is significantly different from an identity matrix. Therefore, if it is significant 
then it means that the correlations between variables are (overall) significantly different 
from zero. So, if Bartlett’s test is significant then it is good news. However, as with any sig¬ 
nificance test, it depends on sample sizes and in factor analysis we typically use very large 
samples. Therefore, although a non-significant Bartlett’s test is certainly cause for concern, 
a significant test does not necessarily mean that correlations are big enough to make the 
analysis meaningful. If you do identify any variables that seem to have very low correla¬ 
tions with lots of other variables, then exclude them from the factor analysis. 

The opposite problem is when variables correlate too highly. Although mild multicol- 
linearity is not a problem for factor analysis it is important to avoid extreme multicollinear- 
ity (i.e., variables that are very highly correlated) and singularity (variables that are perfectly 
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correlated). As with regression, multicollinearity causes problems in factor analysis because 
it becomes impossible to determine the unique contribution to a factor of the variables that 
are highly correlated (as was the case for multiple regression). Multicollinearity does not 
cause a problem for principal components analysis. Therefore, as well as scanning the cor¬ 
relation matrix for low correlations, we could also look out for very high correlations (r > 
.8). The problem with a heuristic such as this is that the effect of two variables correlating 
with r = .9 might be less than the effect of, say, three variables that all correlate at r = .6. In 
other words, eliminating such highly correlating variables might not be getting at the cause 
of the multicollinearity (Rockwell, 1975). 

Multicollinearity can be detected by looking at the determinant of the R-matrix, denoted 
|R| (see Jane Superbrain Box 17.3). One simple heuristic is that the determinant of the 
R-matrix should be greater than 0.00001. 

If you have reason to believe that the correlation matrix has multicollinearity then you 
could look through the correlation matrix for variables that correlate very highly (R > .8) 
and consider eliminating one of the variables (or more, depending on the extent of the 



|ANE SUPERBRAIN 17.3 

What is the determinant? (D 


The determinant of a matrix is an important diagnostic 
tool in factor analysis, but the question of what it is is not 
easy to answer because it has a mathematical definition 
and I’m not a mathematician. Rather than pretending that 
I understand the maths, all I’ll say is that a good explana¬ 
tion of how the determinant is derived can be found at 
http://mathworld.wolfram.com. However, we can bypass 
the maths and think about the determinant conceptually. 


The way that I think of the determinant is as describing 
the ‘area’ of the data. In Jane Superbrain Box 16.2 we 
saw the two diagrams below. 

At the time I used these to describe eigenvectors and 
eigenvalues (which describe the shape of the data). The 
determinant is related to eigenvalues and eigenvectors, but 
instead of describing the height and width of the data it 
describes the overall area. So, in the left diagram below, the 
determinant of those data would represent the area inside 
the dashed elipse. These variables have a low correlation 
so the determinant (area) is big; the biggest value it can be 
is 1. In the right diagram, the variables are perfectly cor¬ 
related or singular, and the elipse (dashed line) has been 
squashed down to basically a straight line. In other words, 
the opposite sides of the ellipse have actually met each 
other and there is no distance between them at all. Put 
another way, the area, or determinant, is zero. Therefore, 
the determinant tells us whether the correlation matrix is 
singular (determinant is 0), or if all variables are completely 
unrelated (determinant is 1), or somewhere in between. 
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problem) before proceeding. You may have to try some trial and error to work out which 
variables are creating the problem (it’s not always the two with the highest correlation, it 
could be a larger number of variables with correlations that are not obviously too large). 


17.4.3. 


The distribution of data © 


As well as looking for interrelations, you should ensure that variables have roughly normal 
distributions and are measured at an interval level (which Likert scales are, perhaps wrongly, 
assumed to be). The assumption of normality is most important if you wish to generalize 
the results of your analysis beyond the sample collected. You can do factor analysis on non- 
continuous data; for example, if you had dichotomous variables you should construct the 
correlation matrix from polychoric correlation coefficients (these can be calculated using 
the polycborQ function, found in the polycor package, which we used in Chapter 6). 7 


17.5. Running the analysis with R Commander © 


If you look through the menus, you’ll find ‘factor analysis’. The factor analysis that’s avail¬ 
able in R Commander is a little limited: it does only one kind of extraction (maximum 
likelihood) and, although this is a good method when it works, if often doesn’t work. 
Understanding why it didn’t work and what to do about it is difficult (and the solution is 
often to just use a different sort of extraction). For this reason, we don’t recommend factor 
analysis with R Commander. 


17.6. Running the analysis with R © 


17.6.1. 


Packages used in this chapter © 


There are several packages we will use in this chapter. You will need the packages corpcor, 
GPArotation (for rotating) and psych (for the factor analysis). If you don’t have these pack¬ 
ages installed you’ll need to install them and load them. 

install.packages("corpcor"); install.packages("GPArotation"); install. 
packages("psych") 

Then you need to load the packages by executing these commands: 
library(corpcor); library(GPArotation); library(psych) 


17.6.2. 


Initial preparation and analysis © 


To run a factor analysis or a principal components analysis you can either use the raw data, 
or you can calculate a correlation matrix, and use that. If you have a massive number of 


7 Note that there is an h in the polychor function, that’s because we’re calculating polychoric correlations, using a 
package that calculates polyc horic and poly serial correlations. (Also note that it’s written by John Fox, author of 
several other packages we use in this book.) 
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cases (and by massive, I mean at least 100,000, and probably closer to 1,000,000) you’re 
better off calculating a correlation matrix first, and then factor-analysing that. If you don’t 
have a massive number of cases, it doesn’t matter which you do. It’s also worth noting at 
this stage that sometimes the analysis doesn’t work, usually because the correlation matrix 
that you’re trying to analyse is weird (R’s Souls’ Tip 17.1). 

First, we’ll load the data into a dataframe called raqData. Set your working directory to 
the location of the file (see section 3.4.4) and execute: 

raqData<-read.delim("raq.dat", header = TRUE) 

We want to include all of the variables in our data set in our factor analysis. We can cal¬ 
culate the correlation matrix, using the cor() function (see Chapter 6): 

raqMatrix<-cor(raqData) 



Warning messages about non-positive 
definite matrix © 


On rare occasions, you might have a non-positive definite matrix. When you have this, R will give unhelpful warn¬ 
ings, such as: 


Warning messages: 

1: In log(det(m.inv.r)) : NaNs produced 

2: In log(det(r)) : NaNs produced 


What R is trying to tell you, in it’s own friendly way, is that the determinant of the R (correlation) matrix is nega¬ 
tive, and hence it cannot find the log of the determinant (‘NaN’ is R’s way of saying “not a number”). This problem 
is usually described as a non-positive definite matrix. 

What is a non-positive definite matrix? As we have seen, factor analysis works by looking at your correla¬ 
tion matrix. This matrix has to be ‘positive definite’ for the analysis to work. What does that mean in plain English? 
It means lots of horrible things mathematically (e.g., the eigenvalues and determinant of the matrix have to be 
positive) and about the best explanation I’ve seen is at http://www2.gsu.edu/~mkteer/npdmatri.html. In more 
basic terms, factors are like lines floating in space, and eigenvalues measure the length of those lines. If your 
eigenvalue is negative then it means that the length of your line/factor is negative too. It’s a bit like me asking you 
how tall you are, and you responding ‘I’m minus 175 cm tali’. That would be nonsense. By analogy, if a factor 
has negative length, then that too is nonsense. When R decomposes the correlation matrix to look for factors, if 
it comes across a negative eigenvalue it starts thinking ’oh dear, I’ve entered some weird parallel universe where 
the usual rules of maths no longer apply and things can have negative lengths, and this probably means that time 
runs backwards, my mum is my dad, my sister is a dog, my head is a fish, and my toe is a frog called Gerald’. It 
still has a go at producing results, but those results probably won't make much sense. (We'd like it if it said ‘these 
results are probably nonsense’, rather than being a bit subtle about it, so you have to be really careful.) 

Things like the KMO test and the determinant rely on a positive definite matrix; if you don’t have one they can’t 
be computed. 

Why have I got a non-positive definite matrix? The most likely answer is that you have too many variables 
and too few cases of data, which makes the correlation matrix a bit unstable. It could also be that you have 
too many highly correlated items in your matrix (singularity, for example, tends to mess things up). In any case 
it means that your data are bad, naughty data, and not to be trusted; if you let them loose then you have only 
yourself to blame for the consequences. 

What can I do? Other than cry, there's not that much you can do. You could try to limit your items, or selec¬ 
tively remove items (especially highly correlated ones) to see if that helps. Collecting more data can help too. 
There are some mathematical fudges you can do, but they’re not as tasty as vanilla fudge and they are hard to 
implement easily. 
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Executing this command creates a matrix of correlation coefficients called raqMatrix. We 
can use this matrix in the analysis (although we don’t have to). It’s a good idea to have a 
look at the correlation matrix, for the reasons we discussed earlier. To make our eyes hurt 
a little less, let’s use the round() function to display only 2 decimal places of the correlation 
matrix that we have just created: 

round(raqMatrix, 2) 

The .R-matrix (or correlation matrix) produced using the cor() function is displayed in 
Output 17.1. You should be comfortable with the idea that to do a factor analysis we need 
to have variables that correlate fairly well, but not perfectly. Also, any variables that cor¬ 
relate with no others should be eliminated. Therefore, we can use this correlation matrix 
to check the pattern of relationships. First, scan the matrix for correlations greater than 
.3, then look for variables that only have a small number of correlations greater than this 
value. Then scan the correlation coefficients themselves and look for any greater than .9. 
If any are found then you should be aware that a problem could arise because of multicol- 
linearity in the data. 
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Output 17.1 

As well as looking at the correlation matrix, we should run Bartlett’s test and the 
KMO on the correlation matrix. Bartlett’s test is run using the cortest.bartlett() function 
from the psych package. We can run this test either on the raw data or on the correlation 
matrix. To run it from the raw data simply input the dataframe (in this case raqData) 
into the function: 

cortest.bartlett(raqData) 

To run it from the correlation matrix (in this case raqMatrix ), input the name of the cor¬ 
relation matrix but also provide the sample size (in this case 2751): 

cortest.bartlett(raqMatrix, n = 2571) 

Both methods will give you the results in Output 17.2. If you ran the test from the raw 
data, you’ll get the warning R was not square, finding R from data , which is nothing to 
worry about it just means that because we didn’t give the function a correlation matrix, it’s 
calculating it from the raw data (that’s what we expect it to do). For factor analysis to work 
we need some relationships between variables and if the .R-matrix were an identity matrix 
then all correlation coefficients would be zero. Therefore, we want this test to be signifi¬ 
cant (i.e., have a significance value less than .05). A significant test tells us that the R-matrix 
is not an identity matrix; therefore, there are some relationships between the variables we 
hope to include in the analysis. For these data, Bartlett’s test is highly significant, x 2 (253) = 
19,334, p < .001, and therefore factor analysis is appropriate. 
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R was not square , finding R from data 

$chisq 

[ 1 ] 19334.49 

$ p.value 
[ 1 ] 0 

$df 

[ 1 ] 253 

Output 17.2 

Next we’d also like the KMO. None of the packages in R currently have a straight¬ 
forward way to calculate the KMO. However, one of the nice things about R is that 
people can write programs to do anything that R doesn’t currently do, and G. Jay 
Kerns, from Youngstown State University (see http://tolstoy.newcastle.edu.aU/R/e2/ 
help/07/08/22816.html) has written one called kmo(), which calculates the KMO and a 
variety of other things. The function itself is easy to use manually (see Oliver Twisted), 
but because it is not part of a package we have included it in our DSUR package so that 
you can use it directly (assuming you have loaded the DSUR package). You can use the 
function by simply entering the name of your dataframe into it and executing. 

kmo(raqData) 

The results of the KMO test are shown in Output 17.3. We came across the KMO 
statistic in section 17.4.1 and saw that Kaiser (1974) recommends a bare minimum of .5 
and that values between .5 and .7 are mediocre, values between .7 and .8 are good, val¬ 
ues between .8 and .9 are great and values above .9 are superb (Hutcheson & Sofroniou, 
1999). For these data the overall value is .93, which falls into the range of being superb (or 
‘marvellous’ as the report puts it), so we should be confident that the sample size and the 
data are adequate for factor analysis. 


OLIVER T WI ST E D ‘Stop spanking my monkey!’, cries an hysterical Oliver, ‘it’s never done 

you any harm, and it’s orange.’ I was talking about the Kaiser-Meyer- 
Pfease Sir, can I have Olkin test, Oliver. ‘Oh, sorry’, he says with a sigh of relief, ‘I thought 

Some more ... kmo? KMO stood for Kill My Orang-utan’. Erm, OK, Oliver has finally lost 

the plot, which I'm fairly sure is what you’ll do if you inspect the kmo() 
function on the companion website. Although we have included it in 
our DSUR package, you can also copy it and execute it manually. 


KMO can be calculated for multiple and individual variables. The value of KMO should 
be above the bare minimum of .5 for all variables (and preferably higher) as well as overall. 
The KMO values for individual variables are produced by the kmo() function too. For these 
data all values are well above .5, which is good news. If you find any variables with values 
below .5 then you should consider excluding them from the analysis (or run the analysis 
with and without that variable and note the difference). Removal of a variable affects the 
KMO statistics, so if you do remove a variable be sure to rerun the kmo() function on the 
new data. 
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$overall 
[ 1 ] 0.9302245 

$report 

[ 1 ] "The KMO test yields a degree of common variance marvelous . 
$individual 
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Output 17.3 

Finally, we’d like the determinant of the correlation matrix. To find the determinant, 
we use the det() function, into which we place the name of a correlation matrix. We have 
computed this matrix already for the current data ( raqMatrix) so we can execute: 

det(raqMatrix) 

If we hadn’t already created the matrix, we could get the determinant by putting the cor() 
function for the raw data into the det() function: 

det(cor(raqDatcO) 

Either method produces the same value: 

[ 1 ] 0.0005271037 

This value is greater than the necessary value of 0.00001 (see section 17.5). As such, our 
determinant does not seem problematic. After checking the determinant, you can, if neces¬ 
sary, eliminate variables that you think are causing the problem. In summary, all questions 
in the RAQ correlate reasonably well with all others and none of the correlation coeffi¬ 
cients are excessively large; therefore, we won’t eliminate any questions at this stage. 



CRAMMING SAM’S TIPS 


Preliminary analysis 


• Scan the correlation matrix ; look for variables that don't correlate with any other variables, or correlate very highly [r = .9) 
with one or more other variables. In factor analysis, check that the determinant of this matrix is bigger than 0.00001; if it is 
then multicollinearity isn’t a problem. 

• Check the KMO and Bartlett’s test: the KMO statistic should be greater than .5 as a bare minimum; if it isn’t collect more 
data. Bartlett’s test of sphericity should be significant (the significance value should be less than .05). 
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17.6.3. 


Factor extraction using R © 


For our present purposes we will use principal components analysis, which strictly speak¬ 
ing isn’t factor analysis; however, the two procedures may often yield similar results (see 
section 17.3.6). Principal component analysis is carried out using the principal() function, 
in the psych package. This function takes the general form: 

pcModel<-principal(datafrcime/R-mcitrix, nfactors = number of factors, rotate = 
"method of rotation", scores = TRUE/FALSE) 

This command creates a principal components model called pcModel, by specifying either 
a dataframe of raw data or a correlation matrix. There are three main options: 

• nfactors allows you to specify how many factors/components you want to extract (see sec¬ 
tion 17.3.8) as a number. If you don’t specify nfactors, then one component is extracted. 

• rotate allows you to specify a method of factor rotation (see section 17.3.9) using a text 
string. If you don’t declare a method of rotation, the default of varimax rotation is used. 

• scores allows you to obtain factor scores (TRUE) or not (FALSE). The default is FALSE. 

I mentioned earlier that when conducting principal components analysis we begin by 
establishing the linear variates within the data and then decide how many of these variates 
to retain (or ‘extract’). Therefore, our starting point is to create a principal components 
model that has the same number of factors as there are variables in the data: by doing this 
we are just reducing the data set down to its underlying factors. By extracting as many fac¬ 
tors as there are variables we can inspect their eigenvalues and make decisions about which 
factors to extract. (Note that if you use factor analysis, rather than principal components 
analysis, you need to extract fewer factors than you have variables - so if you have 23 vari¬ 
ables, extracting 18, or so, factors should be OK.) 

To create this model we execute one of these commands: 

pci <- principal(raqData, nfactors = 23, rotate = "none") 
pci <- principal(raqMatrix, nfactors = 23, rotate = "none") 

The first command creates the model from the raw data and the second from the correla¬ 
tion matrix: both methods will give you identical results, but we will show both through¬ 
out. These commands create a model called pci, which extracts 23 factors - the same as 



A cure for lazy-itis 


Sometimes, I’m too lazy to count the variables in my data set, in which case I can ask R to count them for me, 
using the lengthf) function, which counts the number of items in an object. Therefore, we can obtain the number 
of variables in a dataframe using: 


length(dataFrame) 


Similarly, we can apply this function to a matrix to find out the number of rows in a column of a matrix: 
length(matrix[,l]) 


Therefore, we can use these commands within the principal) function to automatically specify the number of 
factors as the number of variables in the dataframe/matrix by executing: 

pc2 <- principal(>aq, nfactors=length(raqData), rotate="none") 

pc2 <- principal(raqmatrix, nfactors=length(raqMatrix[,l]), rotate="none") 
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the number of variables. If you have a large data set or are just too lazy to remember how 
many variables you have then you can change the command slightly to get R to calculate 
the number of variables in the dataframe or correlation matrix automatically (see R’s Souls’ 
Tip 17.2). A final thing to note is that we have set the rotation method to “none”, which 
means that we won’t carry out factor rotation because we don’t need to at this stage. 

We can look at the results of the principal components analysis by executing its name: 

Pd 

Output 17.4 shows the results of the first principal components model. The first part of 
this is the unrotated loadings. Currently these are not interesting, but they represent the 
loading from each factor or component to each variable. 


Principal Components Analysis 

Call: principal(r = raq, nfactors = 23, rotate = "none") 
Standardized loadings based upon correlation matrix 
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- 0 . 

.34 

0. 

.22 

0 . 

.44 

-0. 

.03 

Qll 

0 . 

.65 

0 . 

.25 

- 0 . 

.21 

-0. 

.40 

0 . 

.13 

0 . 

.18 

- 0 . 

.01 

0 . 

.03 

Q12 

0 . 

.67 

- 0 . 

.05 

0 . 

.05 

0 . 

.25 

0 . 

.04 

- 0 . 

.08 

- 0 . 

. 14 

0 . 

.08 

Q13 

0 . 

.67 

0. 

.08 

0 . 

.28 

-0. 

.01 

0 . 

.13 

0 . 

.03 

- 0 . 

.21 

0 . 

.05 

Q14 

0 . 

.66 

0 . 

.02 

0 . 

.20 

0 . 

. 14 

0 . 

.08 

- 0 . 

.03 

- 0 . 

.10 

-0. 

.06 

Q15 

0 . 

.59 

0 . 

.01 

0 . 

. 12 

-0. 

. 11 

- 0 . 

.07 

0 . 

.29 

0 . 

.32 

-0. 

. 12 

Q16 

0 . 

.68 

0 . 

.01 

- 0 . 

. 14 

0. 

.08 

- 0 . 

.32 

0 . 

.00 

0 . 

. 12 

-0. 

. 14 

Q17 

0 . 

.64 

0. 

.33 

- 0 . 

.21 

-0. 

.34 

0 . 

.10 

0. 

.05 

- 0 . 

.02 

0. 

.03 

Q18 

0 . 

.70 

0. 

.03 

0 . 

.30 

0. 

.13 

0 . 

.15 

-0. 

.09 

- 0 . 

.10 

0. 

.06 

Q19 

- 0 . 

.43 

0. 

.39 

0 . 

.10 

-0. 

.01 

- 0 . 

.15 

0. 

.07 

0 . 

.05 

0. 

.68 

Q20 

0 . 

.44 

-0. 

.21 

- 0 . 

.40 

0. 

.30 

0 . 

.33 

-0. 

.01 

0 . 

.34 

0. 

.03 

Q21 

0 . 

.66 

-0. 

.06 

- 0 . 

.19 

0. 

.28 

0 . 

.24 

-0. 

.15 

0 . 

.18 

0. 

.10 

Q22 

- 0 . 

.30 

0. 

.47 

- 0 . 

. 12 

0. 

.38 

0 . 

.07 

0. 

. 12 

0 . 

.31 

0. 

. 12 

Q2 3 

- 0 . 

. 14 

0 . 

.37 

- 0 . 

.02 

0 . 

.51 

0 . 

.02 

0 . 

.62 

- 0 . 

.28 

-0. 

.22 



PC 17 

PC18 

PC19 

PC2 0 

PC21 

PC22 

PC2 3 

h2 

u2 

Q01 

-0.05 

-0.17 

0.16 

-0.01 

-0.21 

0.05 

0.01 

1 

0.0e+00 

Q02 

-0.08 

0.00 

0.01 

-0.02 

-0.02 

0.03 

0.02 

1 

-3.le-15 

Q03 

0.43 

0.08 

0.09 

0.05 

0.01 

0.00 

0.05 

1 

-1.6e-15 

Q04 

0.19 

0.05 

-0.21 

0.04 

0.09 

-0.02 

0.02 

1 

-1.le-15 

Q05 

-0.04 

0.01 

-0.04 

0.00 

-0.02 

0.02 

0.01 

1 

-2.Oe-15 

Q06 

-0.14 

0.05 

0.09 

-0.07 

0.04 

-0.32 

-0.11 

1 

0.0e+00 

Q07 

0.03 

-0.15 

0.20 

0.16 

0.14 

0.24 

0.09 

1 

l.le-16 

Q08 

0.10 

0.07 

0.12 

-0.15 

0.06 

0.16 

-0.36 

1 

-2.2e-16 

Q09 

-0.19 

-0.02 

-0.08 

-0.03 

0.04 

-0.01 

0.03 

1 

-4.4e-16 

Q10 

0.07 

-0.01 

0.00 

0.04 

-0.03 

0.02 

-0.04 

1 

-4.4e-16 

Qll 

-0.05 

0.07 

0.07 

-0.18 

0.06 

0.00 

0.41 

1 

-8.9e-16 

Q12 

-0.08 

0.04 

0.36 

0.00 

-0.04 

-0.10 

-0.02 

1 

-2.2e-16 

Q13 

-0.06 

-0.32 

-0.30 

-0.06 

0.16 

0.08 

-0.05 

1 

0.0e+00 

Q14 

0.34 

-0.09 

0.06 

0.02 

0.03 

-0.01 

0.05 

1 

-4.4e-16 

Q15 

-0.12 

-0.10 

-0.04 

-0.07 

-0.19 

0.10 

0.00 

1 

-4.4e-16 

Q16 

-0.03 

0.22 

-0.02 

-0.04 

0.35 

-0.12 

-0.01 

1 

-2.Oe-15 
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Q17 

0.04 

o 

o 

1 

-0.10 

0.42 

-0.15 

-0.23 

\—1 
O 

o 

1 

1 

-4.4e-16 

Q18 

-0.06 

0.45 

-0.15 

0.08 

-0.18 

0.23 

0.01 

1 

-8.9e-16 

Q19 

-0.06 

0.01 

0.05 

-0.02 

0.02 

0.04 

-0.02 

1 

-6.7e-16 

Q20 

-0.09 

0.00 

0.04 

0.18 

0.10 

0.06 

-0.04 

1 

-8.9e-16 

Q21 

0.20 

-0.03 

-0.11 

-0.31 

-0.20 

-0.13 

-0.01 

1 

-2.Oe-15 

Q22 

0.04 

-0.06 

0.02 

0.00 

0.01 

-0.01 

0.01 

1 

0.0e+00 

Q2 3 

-0.03 

0.05 

-0.03 

0.01 

-0.01 

-0.02 

0.00 

1 

0.0e+00 



PCI 

PC2 

PC3 

PC4 

PC5 

PC 6 

PC7 

PC 8 

PC9 

SS loadings 

7.29 

1.74 

1.32 

1.23 

0.99 

0.90 

0.81 

0.78 

0.75 

Proportion Var 

0.32 

0.08 

0.06 

0.05 

0.04 

0.04 

0.04 

0.03 

0.03 

Cumulative Var 

0.32 

0.39 

0.45 

0.50 

0.55 

0.59 

0.62 

0.65 

0.69 


PC10 

PC11 

PC 12 

PC13 

PC 14 

PC15 

PC16 

PC17 

PC18 

SS loadings 

0.72 

0.68 

0.67 

0.61 

0.58 

0.55 

0.52 

0.51 

0.46 

Proportion Var 

0.03 

0.03 

0.03 

0.03 

0.03 

0.02 

0.02 

0.02 

0.02 

Cumulative Var 

0.72 

0.75 

0.78 

0.80 

0.83 

0.85 

0.88 

0.90 

0.92 


PC19 

PC2 0 

PC21 

PC22 

PC2 3 





SS loadings 

0.42 

0.41 

0.38 

0.36 

0.33 





Proportion Var 

0.02 

0.02 

0.02 

0.02 

0.01 





Cumulative Var 

0.94 

0.95 

0.97 

0.99 

1.00 






Test of the hypothesis that 23 factors are sufficient. 


The degrees of freedom for the null model are 253 and the objective 
function was 7.55 

The degrees of freedom for the model are -23 and the objective 
function was 0 

The number of observations was 2571 with Chi Square = 0 with prob < NA 

Fit based upon off diagonal values = 1 

Output 17.4 

On the far right of the factor loading matrix are two columns, labelled h2 and u2. h2 is 
the communalities (which are sometimes called h 2 ). These communalities are all equal to 1 
because we have extracted 23 items, the same as the number of variables: we’ve explained 
all of the variance in every variable. When we extract fewer factors (or components) we’ll 
have lower communalities. Next to the communality column is the uniqueness column, 
labelled u2. This is the amount of unique variance for each variable, and it’s 1 minus the 
communality; because all of the communalities are 1, all of the uniquenesses are 0. 8 

The next thing to look at after the factor loading matrix is the eigenvalues. The eigen¬ 
values associated with each factor represent the variance explained by that particular linear 
component. R calls these SS loadings (sums of squared loadings), because they are the sum 
of the squared loadings. (You can also find them in a variable associated with the model 
called values , so in our case we could access this variable using pcl$values). 

R also displays the eigenvalues in terms of the proportion of variance explained. Factor 1 
explains 7.29 units of variance out of a possible 23 (the number of factors) so as a propor¬ 
tion this is 7.29/23 = 0.32; this is the value that R reports. We can convert these propor¬ 
tions to percentages by multiplying by 100; so, factor 1 explains 32% of the total variance. 


8 Some of them are very, very slightly different from zero; for example, question 2 has a uniqueness, which is 
reported as — 3.1e-15, which means .0000000000000031. This is caused by a rounding error (because R stores 
variables to only 15 decimal places). 
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It should be clear that the first few factors explain relatively large amounts of variance 
(especially factor 1) whereas subsequent factors explain only small amounts of variance. 

The eigenvalues show us that four components (or factors) have eigenvalues greater than 
1, suggesting that we extract four components if we use Kaiser’s criterion. By Jolliffe’s 
criterion (retain factors with eigenvalues greater than 0.7) we should retain 10 factors, but 
there is little to recommend this criterion over Kaiser’s. We should also consider the scree 
plot. As mentioned above, the eigenvalues are stored in a variable called pcl$values, and 
we can draw a quick scree plot using the plotQ function, by executing: 

plot(pcl$values, type = "b") 

This command simply plots the eigenvalues (y) against the factor number (x). By default, 
the plot() function will plot points ( type= “p"). We want to see a line so that we can look at 
the trend (we could ask for this by specifying type=“l”), but ideally we want to look at both 
a line and points on the same graph, which is why we specify type=“b”. 




FIGURE 17.7 

Scree plot 
from principal 
components 
analysis of RAQ 
data. The second 
plot shows the 
point of inflexion 
at the fourth 
component. 


Figure 17.7 shows the scree plot; I show it once as R produces it and then again with 
lines showing a plateau and (what I consider to be) the point of inflexion. This curve is dif¬ 
ficult to interpret because it begins to tail off after three factors, but there is another drop 
after four factors before a stable plateau is reached. Therefore, we could probably justify 
retaining either two or four factors. Given the large sample, it is probably safe to assume 
Kaiser’s criterion. The evidence from the scree plot and from the eigenvalues suggests a 
four-component solution may be the best. 

Now that we know how many components we want to extract, we can rerun the ana¬ 
lysis, specifying that number. To do this, we use an identical command to the previous 
model but we change nfactors = 23 to be nfactors = 4 because we now want only four fac¬ 
tors. (We should also change the name of the resulting model so that we don’t overwrite 
the previous one): 

pc2 <- principal(raqData, nfactors = 4, rotate = "none") 
pc2 <- principal(raqMatrix, nfactors = 4, rotate = "none") 

Again, the first command is to run the analysis from the raw data and the second is if you’re 
using the correlation matrix. In both cases the commands create a model called pc2 that is 
the same as before except that we’ve extracted only 4 factors (not 23). We can look at this 
model by executing its name: 

pc2 
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Output 17.5 shows the second principal components model. Again, the output con¬ 
tains the unrotated factor loadings, but only for the first four factors. Notice that these 
are unchanged from the previous factor loading matrix. Also notice that the eigenvalues 
(SS loadings), proportions of variance explained and cumulative proportion of variance 
explained are also unchanged (except now there are only four of them, because we only 
have four components). However, the communalities (the h2 column) and uniquenesses 
(the u2 column) are changed. Remember that the communality is the proportion of com¬ 
mon variance within a variable (see section 17.3.4). Principal components analysis works 
on the initial assumption that all variance is common; therefore, before extraction the 
communalities are all 1. In effect, all of the variance associated with a variable is assumed 
to be common variance. Once factors have been extracted, we have a better idea of how 
much variance is, in reality, common. The communalities in the output reflect this common 
variance. So, for example, we can say that 43% of the variance associated with question 
1 is common, or shared, variance. Another way to look at these communalities is in terms 
of the proportion of variance explained by the underlying factors. Before extraction, there 
were as many factors as there are variables, so all variance is explained by the factors and 
communalities are all 1. However, after extraction some of the factors are discarded and 
so some information is lost. The retained factors cannot explain all of the variance present 
in the data, but they can explain some. The amount of variance in each variable that can 
be explained by the retained factors is represented by the communalities after extraction. 

Now that we have the communalities, we can go back to Kaiser’s criterion to see whether 
we still think that four factors should have been extracted. In section 17.3.8 we saw that 
Kaiser’s criterion is accurate when there are fewer than 30 variables and communalities 
after extraction are greater than .7 or when the sample size exceeds 250 and the average 
communality is greater than .6. Of the communalities in Output 17.5, only one exceeds 
.7. The average of these communalities can be found by adding them up and dividing by 
the number of communalities (11.573/23 = .503). So, on both grounds Kaiser’s rule may 
not be accurate. However, in this instance we should consider the huge sample that we 
have, because the research into Kaiser’s criterion gives recommendations for much smaller 
samples. It’s also worth remembering that we have already inspected the scree plot, which 
should be a good guide in a sample as large as ours. However, given the ambiguity in the 
scree plot (there was also a case for retaining only two factors) you might like to rerun the 
analysis specifying that R extract only two factors and compare the results. 

Principal Components Analysis 

Call: principal(r = raq, nfactors = 4, rotate = "none") 

Standardized loadings based upon correlation matrix 



PCI 

PC2 

PC3 

PC4 

h2 

u2 

Q01 

0. 

.59 

0 . 

.18 

- 0 . 

.22 

0 . 

. 12 

0 

.43 

0 

.57 

Q02 

-0. 

.30 

0 . 

.55 

0 . 

.15 

0 . 

.01 

0 

. 41 

0 

.59 

Q03 

-0. 

.63 

0 . 

.29 

0 . 

.21 

- 0 . 

.07 

0 

.53 

0 

.47 

Q04 

0 . 

.63 

0 . 

. 14 

- 0 . 

.15 

0 . 

.15 

0 

. 47 

0 

.53 

Q05 

0. 

.56 

0 . 

.10 

- 0 . 

.07 

0 . 

. 14 

0 

.34 

0 

.66 

Q06 

0. 

.56 

0 . 

.10 

0 . 

.57 

- 0 . 

.05 

0 

. 65 

0 

.35 

Q07 

0. 

.69 

0 . 

.04 

0 . 

.25 

0 . 

.10 

0 

.55 

0 

.45 

Q08 

0. 

.55 

0 . 

.40 

- 0 . 

.32 

- 0 . 

.42 

0 

.74 

0 

.26 

Q09 

-0. 

.28 

0 . 

.63 

- 0 . 

.01 

0 . 

.10 

0 

.48 

0 

.52 

Q10 

0. 

.44 

0 . 

.03 

0 . 

.36 

- 0 . 

.10 

0 

.33 

0 

.67 

Qll 

0. 

.65 

0 . 

.25 

- 0 . 

.21 

- 0 . 

.40 

0 

. 69 

0 

.31 

Q12 

0. 

.67 

- 0 . 

.05 

0 . 

.05 

0 . 

.25 

0 

.51 

0 

.49 

Q13 

0 . 

.67 

0 . 

.08 

0 . 

.28 

- 0 . 

.01 

0 

.54 

0 

.46 

Q14 

0. 

.66 

0 . 

.02 

0 . 

.20 

0 . 

. 14 

0 

.49 

0 

.51 

Q15 

0. 

.59 

0 . 

.01 

0 . 

. 12 

- 0 . 

. 11 

0 

.38 

0 

.62 

Q16 

0. 

.68 

0 . 

.01 

- 0 . 

. 14 

0 . 

.08 

0 

.49 

0 

.51 

Q17 

0 . 

.64 

0 . 

.33 

- 0 . 

.21 

- 0 . 

.34 

0 

. 68 

0 

.32 
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Q18 

0 . 

70 

0 . 

.03 

0 . 

.30 

0 . 

.13 

0 

. 60 

0 . 

.40 

Q19 

- 0 . 

43 

0 . 

.39 

0 . 

.10 

- 0 . 

.01 

0 

.34 

0 . 

.66 

Q20 

0 . 

44 

- 0 . 

.21 

- 0 . 

.40 

0 . 

.30 

0 

.48 

0 . 

.52 

Q21 

0 . 

66 

- 0 . 

.06 

- 0 . 

.19 

0 . 

.28 

0 

.55 

0 . 

.45 

Q22 

- 0 . 

30 

0 . 

.47 

- 0 . 

. 12 

0 . 

.38 

0 

.46 

0 . 

.54 

Q23 

- 0 . 

14 

0 . 

.37 

- 0 . 

.02 

0 . 

.51 

0 

. 41 

0 . 

.59 


PCI PC2 PC3 PC4 
SS loadings 7.29 1.74 1.32 1.23 
Proportion Var 0.32 0.08 0.06 0.05 
Cumulative Var 0.32 0.39 0.45 0.50 

Test of the hypothesis that 4 factors are sufficient. 

The degrees of freedom for the null model are 253 and the objective 
function was 7.55 

The degrees of freedom for the model are 167 and the objective 
function was 1.03 

The number of observations was 2571 with Chi Square = 2634.37 with prob 
< 0 

Fit based upon off diagonal values = 0.96 

Output 17.5 

There’s another thing that we can look at to see if we’ve extracted the correct number 
of factors: this is the reproduced correlation matrix and the difference between the repro¬ 
duced correlation matrix and the correlation matrix in the data. 

The reproduced correlations are obtained with the factor.model() function. The factor. 
model() function, needs to know the factor loading matrix. The factor loading matrix is 
labelled as an object called loadings in the principal components model; therefore we can 
access it by specifying pc2$loadings (which translates as ‘the loadings object associated with 
the pel model). Therefore, we can get the reproduced correlations by executing: 

factor.modelCpc2$loadings) 

The difference between the reproduced and actual correlation matrices is referred to as 
the residuals, and these are obtained with the factor.residuals() function. You again need 
to provide the factor loading matrix but also the correlation matrix to which you want 
to compare it (in this case the original correlation matrix, raqMatrix). We can, therefore, 
obtain the residuals by executing: 


factor.residuals(raqMatrix, pc2$loadings) 



Q01 

Q02 

Q03 

Q04 

Q05 

Q06 

Q07 

Q08 

Q09 

Q01 

0.435 

-0.112 

-0.372 

0.447 

0.376 

0.218 

0.366 

0.412 

-0.042 

Q02 

-0.112 

0.414 

0.380 

-0.134 

-0.122 

-0.033 

-0.148 

0.002 

0.430 

Q03 

-0.372 

0.380 

0.530 

-0.399 

-0.345 

-0.200 

-0.373 

-0.270 

0.352 

Q04 

0.447 

-0.134 

-0.399 

0.469 

0.399 

0.278 

0.419 

0.390 

-0.073 

Q05 

0.376 

-0.122 

-0.345 

0.399 

0.343 

0.273 

0.380 

0.312 

-0.080 

Q06 

0.218 

-0.033 

-0.200 

0.278 

0.273 

0.654 

0.528 

0.183 

-0.108 

Q07 

0.366 

-0.148 

-0.373 

0.419 

0.380 

0.528 

0.545 

0.267 

-0.161 

Q08 

0.412 

0.002 

-0.270 

0.390 

0.312 

0.183 

0.267 

0.739 

0.055 

Q09 

-0.042 

0.430 

0.352 

-0.073 

-0.080 

-0.108 

-0.161 

0.055 

0.484 

Q10 

0.172 

-0.061 

-0.181 

0.212 

0.205 

0.461 

0.382 

0.180 

-0.116 

Qll 

0.423 

-0.097 

-0.357 

0.419 

0.348 

0.290 

0.363 

0.691 

-0.071 

Q12 

0.402 

-0.219 

-0.440 

0.448 

0.397 

0.388 

0.495 

0.228 

-0.195 

Q13 

0.347 

-0.122 

-0.342 

0.395 

0.360 

0.545 

0.533 

0.313 

-0.147 

Q14 

0.362 

-0.155 

-0.373 

0.411 

0.370 

0.477 

0.514 

0.249 

-0.159 
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Q15 

0.311 

-0.158 

-0.337 

0.343 

0.306 

0.406 

0.425 

0.339 

-0.174 

Q16 

0.440 

-0.217 

-0.458 

0.466 

0.400 

0.300 

0.439 

0.390 

-0.175 

Q17 

0.439 

-0.048 

-0.331 

0.434 

0.359 

0.290 

0.365 

0.695 

-0.009 

Q18 

0.368 

-0.149 

-0.376 

0.424 

0.388 

0.562 

0.570 

0.250 

-0.168 

Q19 

-0.204 

0.357 

0.403 

-0.231 

-0.207 

-0.147 

-0.254 

-0.104 

0.363 

Q20 

0.342 

-0.301 

-0.440 

0.353 

0.292 

-0.021 

0.219 

0.164 

-0.218 

Q21 

0.449 

-0.254 

-0.488 

0.480 

0.412 

0.244 

0.430 

0.282 

-0.191 

Q22 

-0.025 

0.333 

0.275 

-0.050 

-0.060 

-0.209 

-0.179 

-0.099 

0.417 

Q2 3 

0.045 

0.246 

0.158 

0.042 

0.028 

-0.082 

-0.037 

-0.136 

0.323 


Output 17.6 

Output 17.6 shows an edited version of the reproduced correlation matrix that was 
requested using the factor.model() function in the first table. The diagonal of this matrix con¬ 
tains the communalities after extraction for each variable (you can check the values against 
Output 17.5). Output 17.7 contains an extract from the matrix of residuals: the difference 
between the fitted model and the real data. The diagonal of this matrix is the uniquenesses. 



Q01 

Q02 

Q03 

Q04 

Q05 

Q06 

Q07 

Q08 

Q09 

Q01 

0.565 

0.013 

0.035 

-0.011 

0.027 

-0.001 

-0.061 

-0.081 

-0.050 

Q02 

0.013 

0.586 

-0.062 

0.022 

0.003 

-0.041 

-0.011 

-0.052 

-0.115 

Q03 

0.035 

-0.062 

0.470 

0.019 

0.035 

-0.027 

-0.009 

0.011 

-0.052 

Q04 

-0.011 

0.022 

0.019 

0.531 

0.002 

0.000 

-0.010 

-0.041 

-0.051 

Q05 

0.027 

0.003 

0.035 

0.002 

0.657 

-0.016 

-0.041 

-0.044 

-0.016 

Q06 

-0.001 

-0.041 

-0.027 

0.000 

-0.016 

0.346 

-0.014 

0.040 

-0.005 

Q07 

-0.061 

-0.011 

-0.009 

-0.010 

-0.041 

-0.014 

0.455 

0.030 

0.033 

Q08 

-0.081 

-0.052 

0.011 

-0.041 

-0.044 

0.040 

0.030 

0.261 

-0.039 

Q09 

-0.050 

-0.115 

-0.052 

-0.051 

-0.016 

-0.005 

0.033 

-0.039 

0.516 

Q10 

0.042 

-0.023 

-0.013 

0.003 

0.053 

-0.139 

-0.098 

-0.021 

-0.018 

Qll 

-0.066 

-0.046 

0.006 

-0.051 

-0.050 

0.038 

-0.018 

-0.061 

-0.045 

Q12 

-0.057 

0.024 

0.030 

-0.006 

-0.050 

-0.076 

-0.072 

0.024 

0.027 

Q13 

0.008 

-0.021 

0.024 

-0.051 

-0.058 

-0.078 

-0.091 

0.001 

-0.021 

Q14 

-0.024 

-0.009 

0.002 

-0.060 

-0.055 

-0.075 

-0.074 

0.032 

0.038 

Q15 

-0.065 

-0.007 

0.025 

-0.009 

-0.045 

-0.047 

-0.033 

-0.039 

-0.012 

Q16 

0.059 

0.050 

0.039 

-0.050 

-0.005 

-0.056 

-0.051 

-0.068 

-0.014 

Q17 

-0.069 

-0.039 

0.003 

-0.052 

-0.049 

-0.008 

0.025 

-0.105 

-0.027 

Q18 

-0.020 

-0.015 

0.001 

-0.042 

-0.066 

-0.048 

-0.069 

0.030 

0.018 

Q19 

0.015 

-0.153 

-0.061 

0.045 

0.041 

-0.020 

-0.015 

-0.056 

-0.114 

Q2 0 

-0.128 

0.099 

0.115 

-0.110 

-0.092 

0.122 

0.002 

0.011 

0.060 

Q21 

-0.120 

0.049 

0.071 

-0.070 

-0.078 

0.029 

0.053 

0.014 

0.055 

Q22 

-0.079 

-0.102 

-0.071 

-0.049 

-0.072 

0.043 

0.010 

0.020 

-0.161 

Q2 3 

-0.049 

-0.147 

-0.008 

-0.076 

-0.070 

0.013 

-0.033 

0.086 

-0.152 


Output 17.7 

The correlations in the reproduced matrix differ from those in the R-matrix because they 
stem from the model rather than the observed data. If the model were a perfect fit to the 
data then we would expect the reproduced correlation coefficients to be the same as the 
original correlation coefficients. Therefore, to assess the fit of the model we can look at 
the differences between the observed correlations and the correlations based on the model. 
For example, if we take the correlation between questions 1 and 2, the correlation based on 
the observed data is —.099 (taken from Output 17.1). The correlation based on the model 
is —.112, which is slightly higher. We can calculate the difference as follows : 


residual = r observed - r frommodel 
residual^ =(-0.099)-(-0.112) 
= 0.013 
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You should notice that this difference is the value quoted in Output 17.7 for questions 1 
and 2. Therefore, Output 17.7 contains the differences between the observed correlation 
coefficients and the ones predicted from the model. For a good model these values will 
all be small. There are several ways we can define how small we want the residuals to be. 

One approach is to see how large the residuals are, compared to the original correlations. 
The very worst the model could be (if we extracted no factors at all) would be the size of the 
correlations in the original data. Thus one approach is to compare the size of the residuals 
with the size of the correlations. If the correlations were small to start with, we’d expect 
very small residuals. If the correlations were large to start with, we wouldn’t mind if the 
residuals were relatively larger. So one measure of the residuals is to compare the residuals 
with the original correlations - because residuals are positive and negative, they should be 
squared before doing that. A measure of the fit of the model is therefore the sum of the 
squared residuals divided by the sum of the squared correlations. As this is considered a 
measure of fit and sometimes people like measures of fit to go from 0 to 1, we subtract 
the value from 1. This statistic is given at the bottom of the main output (Output 17.5) as: 

Fit based upon off diagonal values = 0.96 

Values over 0.95 are often considered indicators of good fit, and as our value is 0.96, this 
indicates that four factors are sufficient. 

There are many other ways of looking at residuals, which we’ll now explore. We couldn’t find 
an R function to do these other things, but we will write one as we go along. 9 A simple approach 
to residuals is just to say that we want the residuals to be small. In fact, we want most values to 
be less than 0.05. We can work out how many residuals are large by this criterion fairly easily 
in R. First, we need to extract the residuals into a new object. We need to do this because at the 
moment the matrix of residuals is symmetrical (so the residuals are repeated above and below 
the diagonal of the matrix), and also the diagonal of the matrix does not contain residuals. First 
let’s create an object called residuals that contains the factor residuals by executing: 

residuals<-factor.residuals(raqMatrix, pc2$loadings) 

We can then extract the upper triangle of this matrix using the upper.tri() function. This has 
the effect of extracting only the elements above the diagonal (so we discard the diagonal 
elements and the elements below the diagonal): 

residuals<-as.matrix(residuals[upper.tri(residuals)]) 

This command re-creates the object residuals by using only the upper triangle of the origi¬ 
nal matrix. The as.matrix() function just makes sure that the residuals are stored as a matrix 
(they’re actually stored as a single column of data). We now have an object called residuals 
that contains the residuals stored in a column. This is handy because it makes it easy to 
calculate various things. For example, if we want to know how many large residuals there 
are (i.e., residuals with absolute values greater than 0.05) then we can execute: 

large.resid<-abs(residuals) > 0.05 

which uses the abs() function to first compute the absolute value of the column of residuals 
(this is so we ignore whether the residual is positive or negative). The > 0.05 in the com¬ 
mand means that large.resid will be TRUE (or 1) if the residual is greater than 0.05, and 
false (or 0) if the residual is less than or equal to 0.05. We end up with a column the same 
length as the matrix of factor residuals but containing values of TRUE (if the residual is 
large) or FALSE (if it is small). We can then use the sum() function to add up the number 
of TRUE responses in the matrix: 

sum(large.resid) 

9 R has over 3000 packages. For relatively simple things, it’s often easier to write a small function yourself than 
try to find whether a function already exists. Or, you can find a friend that can write a function for you. We will 
show you how, because we’re your friends. 
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FIGURE 17.8 

Histogram of the 
model residuals 


The result is 91. If we want to know this as a proportion of the total number of residuals 
we can simply execute: 

sum(large.resid)/nrow(residuals) 

Executing this command will return the number of large residuals (sum(large.resid)) divided 
by the total number of residuals: nrows() tells us how many items (i.e., residuals) there are 
in total. This will return a value of 0.3596, or 36%. There are no hard and fast rules about 
what proportion of residuals should be below 0.05; however, if more than 50% are greater 
than 0.05 you probably have grounds for concern. For our data, we have 36% so we need 
not worry. 

Another way to look at the residuals is to look at their mean. Rather than looking at the 
mean, we should square the residuals, find the mean, and then find the square root. This 
is the root-mean-square residual. Again, this is easy to calculate from our residuals object. 
We can execute: 

sqrt(mean(residuals A 2)) 

This command squares each item in the residuals object (residuals ^ 2), then uses the meanQ 
function to compute the mean of these squared residuals. The sqrt() function is then used 
to compute the square root of that mean. The resulting value is 0.055, that’s our mean 
residual. A little lower would have been nice, but this is not dreadful. If this were much 
higher (say 0.08) we might want to consider extracting more factors. 

Finally, it’s worth looking at the distributions of the residuals - we expect the residuals 
to be approximately normally distributed - if there are any serious outliers, even if the 
other values are all good, we should probably look further into that. We can again use our 
residuals object to plot a quick histogram using the hist() function: 

hist(residuals) 

Figure 17.8 shows the histogram of the residuals. They do seem approximately normal and 
there are no outliers. We could wrap these commands up in a nice function called residual, 
stats () so that we can use it again in other factor analyses (R’s Souls’ Tip 17.3). 


Histogram of residuals 
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Creating a residual.stats() function (D 


We saw (in R’s Souls’ Tip 6.2) that you can write your own functions in R. If we wanted to wrap all of the factor 
analysis residual commands into a function we can do this fairly easily by executing: 


residual.stats<-function(matrix){ 

residuals<-as.matrixCmatrix[upper.tri(matrix)]) 
large.resid<-abs(residuals) > 0.05 
numberLargeResids<-sum(large.resid) 
propLargeResid<-numberLargeResids/nrow(residuals) 
rmsr<-sqrt(mean(residuals A 2)) 


> 


cat("Root means squared residual = ", rmsr, 
catC'Number of absolute residuals > 0.05 = " 
cat("Proportion of absolute residuals > 0.05 
hist(residuals) 


"\n") 

, numberLargeResids, 
= ", propLargeResid 


\n") 

"\n") 


The first line creates the function by naming it residual.stats and telling it to expect a matrix as input. The com¬ 
mands within { } are explained within the main text: they extract the residuals from the matrix entered into the 
function, compute the number ( numberLargeResids ) and proportion {propLargeResid ) of absolute values greater 
than 0.05, compute the root mean squared residual {rmsr), and plot a histogram. The commands using the cat() 
function simply specify the text and values to appear in the output. 

Having executed the function, we could use it on our residual matrix in one of two ways. First, we could cal¬ 
culate the residual matrix using the factor.residuals() function, and label the resulting matrix resids. Then pop this 
matrix into the residual.stats() function: 

resids <- factor. residuals(raqMatrix, pc2$loadings) 
residual.stats(resids) 

The second way is to combine these steps and calculate the residuals matrix directly inside the residual.stats() 
function: 

residual.stats(factor.residuals(raqMatrix, pc2$loadings)) 

The output would be as follows (and the histogram in Figure 17.8): 

Root means squared residual = 0.05549286 

Number of absolute residuals > 0.05 = 91 

Proportion of absolute residuals > 0.05 = 0.3596838 



CRAMMING SAM’S TIPS 


Factor extraction 


• To decide how many factors to extract, look at the eigenvalues and the scree plot. 

• If you have fewer than 30 variables then using eigenvalues greater than 1 is OK (Kaiser's criterion) as long as your commu- 
nalities are all over .7. Likewise, if your sample size exceeds 250 and the average of the communalities is .6 or greater then 
this is also fine. Alternatively, with 200 or more participants the scree plot can be used. 

• Check the residuals and make sure that fewer than 50% have absolute values greater than 0.05, and that the model fit is 
greater than 0.90. 
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17.6.4. 


Rotation © 


We have already seen that the interpretability of factors can be improved through rotation. 
Rotation maximizes the loading of each variable on one of the extracted factors while mini¬ 
mizing the loading on all other factors. This process makes it much clearer which variables 
relate to which factors. Rotation works through changing the absolute values of the variables 
while keeping their differential values constant. I’ve discussed the various rotation options in 
section 17.3.9.1, but, to summarize, the exact choice of rotation will depend on whether or 
not you think that the underlying factors should be related. If there are theoretical grounds 
to think that the factors are independent (unrelated) then you should choose one of the 
orthogonal rotations (I recommend varimax). However, if theory suggests that your factors 
might correlate then one of the oblique rotations (oblimin or promax) should be selected. 


17.6.4.1. Orthogonal rotation (varimax) © 

To carry out a varimax rotation, we change the rotate option in the principalQ function 
from “none” to “varimax” (we could also exclude it altogether because varimax is the 
default if the option is not specified): 

pc3 <- principal(raqData, nfactors = 4, rotate = "varimax") 
pc3 <- principal(raqMatrix, nfactors = 4, rotate = "varimax") 

The first command is to run the analysis from the raw data and the second is if you’re using 
the correlation matrix. In both cases the commands create a model called pc3 that is the 
same as the previous model (pc2) except that we have used varimax rotation on the model. 
We can look at this model by executing its name: 

pc2 

Output 17.8 shows the first part of the rotated component matrix (also called the 
rotated factor matrix), which is a matrix of the factor loadings for each variable on each 
factor. This matrix contains the same information as the component matrix in Output 
17.5, except that it is calculated after rotation. Notice that the loadings have changed, but 
the h2 (communality) and u2 (uniqueness) columns have not. Rotation changes factors to 
distribute the variance differently, but it cannot account for more or less variance in the 
variables than it could before rotation. Also notice that the eigenvalues (SS loadings) have 
changed. One of the aims of rotation is to even up the eigenvalues; however, the sum of the 
eigenvalues (and the proportion of variance accounted for) cannot change during rotation. 

Interpreting the factor loading matrix is a little complex, and we can make it easier by 
using the print.psych() function. This does two things: first, it removes loadings that are 
below a certain value that we specify (by using the cut option); and second, it reorders 
the items to try to put them into their factors, which we request using the sort option. 
Generally you should be very careful with the cut-off value - if you think that a loading of 
.4 will be interesting, you should use a lower cut-off (say, .3), because you don’t want to 
miss a loading that was .39. Execute this command: 

print.psych(pc3, cut = 0.3, sort = TRUE) 

This command prints the factor loading matrix associated with the model pc3, but display¬ 
ing only loadings above .3 ( cut = 0.3) and sorting items by the size of their loadings ( sort 
= TRUE). 

Principal Components Analysis 

Call: principal(r = raqData, nfactors = 4, residuals = TRUE, rotate = 
"varimax") 
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Standardized loadings based upon correlation matrix 
RC3 RC1 RC4 RC2 h2 u2 

Q01 0.24 0.50 0.36 0.06 0.43 0.57 

Q02 -0.01 -0.34 0.07 0.54 0.41 0.59 
Q03 -0.20 -0.57 -0.18 0.37 0.53 0.47 
Q04 0.32 0.52 0.31 0.04 0.47 0.53 

Q05 0.32 0.43 0.24 0.01 0.34 0.66 

Q06 0.80 -0.01 0.10 -0.07 0.65 0.35 

Q07 0.64 0.33 0.16 -0.08 0.55 0.45 

RC3 RC1 RC4 RC2 
SS loadings 3.73 3.34 2.55 1.95 

Output 17.8 

The resulting matrix is in Output 17.9. Compare this matrix to the unrotated solution 
(Output 17.5). Before rotation, most variables loaded highly on the first factor and the 
remaining factors didn’t really get a look in. However, the rotation of the factor struc¬ 
ture has clarified things considerably: there are four factors and variables load very highly 
onto only one factor (with the exception of one question). The suppression of loadings less 
than .3 and ordering variables by loading size also make interpretation considerably easier 
(because you don’t have to scan the matrix to identify substantive loadings). 

The next step is to look at the content of questions that load onto the same factor to try 
to identify common themes. If the mathematical factor produced by the analysis represents 
some real-world construct then common themes among highly loading questions can help 
us identify what the construct might be. The questions that load highly on factor 1 are 
Q6 (I have little experience of computers) with the highest loading of .80, Q18 (R always 
crashes when I try to use it), Q13 (I worry I will cause irreparable damage ...), Q7 (All 
computers hate me), Q14 (Computers have minds of their own ...), Q10 (Computers are 
only for games), and Q15 (Computers are out to get me) with the lowest loading of .46. 
All these items seem to relate to using computers or R. Therefore we might label this factor 
fear of computers. 

Looking at factor 2, we have Q20 (Everybody looks at me when I use R), with a loading 
of .68, Q21 (I wake up under my duvet ...), Q3 (Standard deviations excite me), 10 Q12 
(People try to tell you that R makes statistics easier ...), Q4 (I dream that Pearson is attack¬ 
ing me), Q16 (I weep openly at the mention of central tendency), Q1 (Statistics makes me 
cry) and Q5 (I don’t understand statistics), with the lowest loading of .52 - this item also 
loads moderately on some of the other factors. The questions that load highly on factor 2 
all seem to relate to different aspects of statistics; therefore, we might label this factor fear 
of statistics. 

Principal Components Analysis 

Call: principal(r = raqData, nfactors = 4, rotate = "varimax") 
Standardized loadings based upon correlation matrix 



item 

RC3 

RC1 

RC4 

RC2 

h2 


u2 

Q06 

6 

0 

.80 




0 

.65 

0 . 

.35 

Q18 

18 

0 

. 68 

0 

.33 


0 

.60 

0 . 

.40 

Q13 

13 

0 

. 65 




0 

.54 

0 . 

.46 

Q07 

7 

0 

. 64 

0 

.33 


0 

.55 

0 . 

.45 

Q14 

14 

0 

.58 

0 

.36 


0 

.49 

0 . 

.51 

Q10 

10 

0 

.55 




0 

.33 

0 . 

. 67 

Q15 

15 

0 

.46 




0 

.38 

0 . 

. 62 

Q2 0 

20 



0 

. 68 


0 

.48 

0 . 

.52 

Q21 

21 



0 

. 66 


0 

.55 

0 . 

.45 


10 Note that this variable has a negative loading - this means that a high score on the factor is associated with a 
lower score on this item. 
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Q03 

3 



-0 . 

.57 



0 . 

.37 

0. 

.53 

0 . 

. 47 

Q12 

12 

0 . 

. 47 

0 . 

.52 





0. 

.51 

0 . 

.49 

Q04 

4 

0 . 

.32 

0 . 

.52 

0 . 

31 



0. 

.47 

0 . 

.53 

Q16 

16 

0 . 

.33 

0 . 

.51 

0 . 

31 



0. 

.49 

0 . 

.51 

Q01 

1 



0 . 

.50 

0 . 

36 



0. 

.43 

0 . 

.57 

Q05 

5 

0 . 

.32 

0 . 

.43 





0. 

.34 

0 . 

. 66 

Q08 

8 





0 . 

83 



0. 

.74 

0 . 

.26 

Q17 

17 





0 . 

75 



0. 

.68 

0 . 

.32 

Qll 

11 





0 . 

75 



0. 

.69 

0 . 

.31 

Q09 

9 







0. 

. 65 

0. 

.48 

0 . 

.52 

Q22 

22 







0 . 

. 65 

0. 

.46 

0 . 

.54 

Q2 3 

23 







0. 

.59 

0. 

.41 

0 . 

.59 

Q02 

2 



-0 . 

.34 



0 . 

.54 

0. 

.41 

0 . 

.59 

Q19 

19 



-0 . 

.37 



0 . 

.43 

0. 

.34 

0 . 

. 66 





RC3 

RC1 

RC4 

RC2 





SS loadings 3.73 3.34 2.55 1.95 
Proportion Var 0.16 0.15 0.11 0.08 
Cumulative Var 0.16 0.31 0.42 0.50 


Test of the hypothesis that 4 factors are sufficient. 


The degrees of freedom for the null model are 253 and the 
objective function was 7.55 

The degrees of freedom for the model are 167 and the objective 
function was 1.03 

The number of observations was 2571 with Chi Square = 2634.37 
with prob < 0 

Fit based upon off diagonal values = 0.96 

Output 17.9 

Factor 3 has only three items loading on it. Q8 (I have never been good at mathematics), 
Q17 (I slip into a coma when I see an equation), and Qll (I did badly at mathematics at 
school). The three questions that load highly on factor 3 all seem to relate to mathematics; 
therefore, we might label this factor fear of mathematics. 

Finally, the questions that load highly on factor 4 are Q9 (My friends are better at statis¬ 
tics than me), Q22 (My friends are better at R), Q2 (My friends will think Fm stupid) and 
Q19 (Everybody looks at me). All these items contain some component of social evaluation 
from friends; therefore, we might label this factor peer evaluation. 

This analysis seems to reveal that the initial questionnaire, in reality, is composed of four 
subscales: fear of computers, fear of statistics, fear of maths and fear of negative peer evalu¬ 
ation. There are two possibilities here. The first is that the RAQ failed to measure what it 
set out to (namely, R anxiety) but does measure some related constructs. The second is that 
these four constructs are sub-components of R anxiety; however, the factor analysis does 
not indicate which of these possibilities is true. 


17.6.4.2. Oblique rotation © 


When we did the orthogonal rotation, we told R that we expected the components that 
we extracted to be uncorrelated. This was a bit of a strange thing to say. All of our factors 
related to fear: fear of computers, fear of statistics, fear of negative peer evaluation and 
feed of mathematics. It’s likely that these will be correlated: people with fear of one thing 
might have fear of other things. If this is the case an oblique rotation is called for. 
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The command for an oblique rotation is very similar to that for an orthogonal rotation, 
we just change the rotate option, from “varimax” to “oblimin”: 

pc4 <- principal(raqData, nfactors = 4, rotate = "oblimin") 
pc4 <- principal(raqMatrix, nfactors = 4, rotate = "oblimin") 

The first command is to run the analysis from the raw data and the second is if you’re using 
the correlation matrix. In both cases the commands create a model called pc4 that is the 
same as the model pc2 except that we have used oblimin rotation on the model. As with the 
previous model, we can look at the factor loadings from this model in a nice easy-to-digest 
format by executing: 

print.psych(pc4, cut = 0.3, sort = TRUE) 

The output from this analysis is shown in Output 17.10. The same four factors seem to 
have emerged although they are in a different order. Factor 1 seems to represent fear of 
computers, factor 2 represents fear of peer evaluation, factor 3 represents fear of statistics 
and factor 4 represents fear of mathematics. 

Principal Components Analysis 

Call: principal(r = raqData, nfactors = 4, rotate = "oblimin") 
Standardized loadings based upon correlation matrix 
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TCI TC4 TC3 TC2 
SS loadings 3.90 2.88 2.94 1.85 
Proportion Var 0.17 0.13 0.13 0.08 
Cumulative Var 0.17 0.29 0.42 0.50 


With factor correlations of 
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0.44 

1.00 

0.31 

-0.10 

TC3 

0.36 

0.31 

1.00 

-0.17 

TC2 

1 

o 

1—* 

00 

-0.10 

-0.17 

1.00 
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Test of the hypothesis that 4 factors are sufficient. 

The degrees of freedom for the null model are 253 and the 
objective function was 7.55 

The degrees of freedom for the model are 167 and the objective 
function was 1.03 

The number of observations was 2571 with Chi Square = 2634.37 
with prob < 0 

Fit based upon off diagonal values = 0.96 

Output 17.10 

Also in this output you’ll find a correlation matrix between the factors. This matrix 
contains the correlation coefficients between factors - R didn’t bother to show this to us 
when it did an orthogonal rotation, because the correlations were all zero. Factor 2 (TC2) 
has little relationship with any other factors (the correlation coefficients are low), but all 
other factors are interrelated to some degree (notably TC3 with both TCI and TC4, and 
TC4 with TCI). The fact that these correlations exist tell us that the constructs measured 
can be interrelated. If the constructs were independent then we would expect oblique 
rotation to provide an identical solution to an orthogonal rotation and the component 
correlation matrix should be an identity matrix (i.e., all factors have correlation coef¬ 
ficients of 0). Therefore, this final matrix gives us a guide to whether it is reasonable to 
assume independence between factors: for these data it appears that we cannot assume 
independence. Therefore, the results of the orthogonal rotation should not be trusted: the 
obliquely rotated solution is probably more meaningful. 

When an oblique rotation is conducted the factor matrix is split into two matrices: the 
pattern matrix and the structure matrix (see Jane Superbrain Box 17.1). For orthogonal rota¬ 
tion these matrices are the same. The pattern matrix contains the factor loadings and is 
comparable to the factor matrix that we interpreted for the orthogonal rotation. The struc¬ 
ture matrix takes into account the relationship between factors (in fact it is a product of the 
pattern matrix and the matrix containing the correlation coefficients between factors). Most 
researchers interpret the pattern matrix, because it is usually simpler; however, there are situ¬ 
ations in which values in the pattern matrix are suppressed because of relationships between 
the factors. Therefore, the structure matrix is a useful double-check and Graham et al. (2003) 
recommend reporting both (with some useful examples of why this can be important). 

Getting the structure matrix out of R is a little bit more complex than getting the pattern 
matrix. You need to multiply the factor loading matrix by the correlation matrix of the 
factors. We’ve come across the loadings, these are called pc4$loadings. The correlations of 
the factors are called the Phi (Greek letter (p, which rhymes with pie) and so are stored in 
pc4$Pbi. Given that we have these two matrices, we can get the structure matrix by multi¬ 
plying them; however, this is not a regular multiplication, this is a matrix multiplication, so 
instead of writing * we write %*%. The structure matrix is therefore given by executing: 

pc4$loadings %*% pc4$Phi 

The kind of people that write R think that this is straightforward, but we realize it’s 
not, especially when you’re starting out. Also, doing this calculation produces a rather 
unfriendly looking structure matrix that isn’t sorted by the size of factor loadings. So, 
we’ve written a function for you, called factor.structure(); you can source it from our DSUR 
package. The function takes this general form: 

factor.structure(pcModel, cut = 0.2, decimals = 2) 

All you need to do is enter the name of the principal components model into the function 
and execute. Just like the print.psych() function we have included an option (cut) so you 
can specify a value below which you don’t want to see the loading (the default is .2), and 
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also an option, decimals, that allows you to change the number of decimal places you see 
(the default is 2). For our current model we could execute: 

factor.structure(pc4, cut =0.3) 

Output 17.11 shows the structure matrix. The picture becomes more complicated in the 
structure matrix because with the exception of factor 2, several variables load quite highly 
onto more than one factor. This has occurred because of the relationship between factors 1 
and 3 and between factors 3 and 4. This example should highlight why the pattern matrix 
is preferable for interpretative reasons: because it contains information about the unique 
contribution of a variable to a factor. 



TCI 

TC4 

TC3 

Q06 

0 

.78 





Q18 

0 

.76 

0 

.36 

0 

.42 

Q13 

0 

.72 

0 

.43 

0 

.33 

Q07 

0 

.72 

0 

.38 

0 

.42 

Q14 

0 

.67 

0 

.35 

0 

.44 

Q12 


0.6 

0 

.33 

0 

.59 

Q10 

0 

.56 





Q15 

0 

.55 

0 

.44 

0 

.31 

Q08 



0 

.85 



Q17 

0 

.44 

0 

.82 


0.3 

Qll 

0 

.43 

0 

.82 



Q21 

0 

.46 

0 

.37 


0.7 

Q2 0 





0 

.68 

Q03 

-0 

.39 

-0 

.36 

-0 

.64 

Q16 


0.5 


LT) 

O 

0 

.58 

Q04 

0 

.47 

0 

.49 

0 

.56 

Q01 


0.4 


0.5 

0 

.53 

Q05 

0 

.44 


0.4 

0 

.47 


Q22 

Q09 

Q23 

Q02 -0.39 

Q19 -0.44 


TC2 


0.41 


0.66 
0.66 
0.58 
0.55 
0.45 


Output 17.11 


On a theoretical level the dependence between our factors does not cause concern; we 
might expect a fairly strong relationship between fear of maths, fear of statistics and fear 
of computers. Generally, the less mathematically and technically minded people struggle 
with statistics. However, we would not expect these constructs to correlate with fear of 
peer evaluation (because this construct is more socially based). In fact, this factor is the one 
that correlates fairly badly with all others - so on a theoretical level, things have turned 
out rather well! 


17.6.5. 


Factor scores © 


Having reached a suitable solution and rotated that solution, we can look at the factor 
scores. Factor scores are obtained by adding scores = TRUE to the principal) function. 
Therefore, to get factor scores for our model pc4, we would rerun the analysis using by 
executing: 

pc5 <- principal(raqData, nfactors = 4, rotate = "oblimin", scores = TRUE) 
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CRAMMING SAM’S TIPS 


Interpretation 


• If you’ve conducted orthogonal rotation then look at the table labelled rotated component matrix. For each variable, note 
the component for which the variable has the highest loading. Also, for each component, note the variables that load highly 
onto it (by ‘high’ I mean loadings should be above .4 when you ignore the plus or minus sign). Try to make sense of what 
the factors represent by looking for common themes in the items that load onto them. 

• If you’ve conducted oblique rotation then calculate and look at the pattern matrix. For each variable, note the component for 
which the variable has the highest loading. Also, for each component, note the variables that load highly onto it (by ‘high’ I 
mean loadings should be above .4 when you ignore the plus or minus sign). Double-check what you find by doing the same 
thing for the structure matrix. Try to make sense of what the factors represents by looking for common themes in the items 
that load onto them. 


By setting the scores option to TRUE the factor scores are added to the principal com¬ 
ponent model in an object called scores; therefore, we can access these scores by using 
pcS$scores (which translates as the scores object attached to the model pcS that we just 
created). To view the factor scores, you could execute: 

pc5$scores 

However, there are rather a lot of them (2571 actually), so let’s look at the first 10 rows, 
by using the headQ function and executing: 

headCpc5$scores, 10) 




SELF-TEST 

s Using what you learnt in Chapter 6, or Section 
17.6.2, calculate the correlation matrix for the factor 
scores. Compare this to the correlations of the 
factors in Output 17.10. 



TCI 

TC4 

TC3 

TC2 

[1,1 

0.37296709 

1.8808424 

0.95979596 

0.3910711 

[2, ] 

0.63334164 

0.2374679 

0.29090777 

-0.3504080 

[3, ] 

0.39712768 

-0.1056263 

-0.09333769 

0.9249353 

[4, ] 

-0.78741595 

0.2956628 

-0.77703307 

0.2605666 

[5, ] 

0.04425942 

0.6815179 

0.59786611 

-0.6912687 

[6, ] 

-1.70018648 

0.2091685 

0.02784164 

0.6653081 

[7, ] 

0.66139239 

0.4224096 

1.52552021 

-0.9805434 

[8, ] 

0.59491329 

0.4060248 

1.06465956 

-1.0932598 

[9, ] 

-2.34971189 

-3.6134797 

-1.42999472 

-0.5443773 

[10, ] 

0.93504597 

0.2285419 

0.96735727 

-1.5712753 


Output 17.12 

Output 17.12 shows the factor scores for the first 10 participants. Factor scores can be used 
in this way to assess the relative fear of one person compared to another. We can also use factor 
scores in regression when groups of predictors correlate so highly that there is multicollinearity. 

Before we can do any analysis with our factor scores, we need to add the factor scores 
into our dataframe. To do this, we use the cbind() function, which we have used numerous 
times before: 

raqData <- cbind(rdqDdtd, pc5$scores) 
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SELF-TEST 

s Can you think of another way of obtaining the 
structure matrix (the correlations between factors 
and items) now you’ve learned about factor scores? 



17.6.6. 


Summary (D 


To sum up, the analyses revealed four underlying scales in our questionnaire that may, 
or may not, relate to genuine sub-components of R anxiety. It also seems as though an 
obliquely rotated solution was preferred due to the interrelationships between factors. The 
use of factor analysis is purely exploratory; it should be used only to guide future hypothe¬ 
ses, or to inform researchers about patterns within data sets. A great many decisions are left 
to the researcher using factor analysis, and I urge you to make informed decisions, rather 
than basing decisions on the outcomes you would like to get. In section 17.9 we consider 
whether or not our scale is reliable. 


17.7. How to report factor analysis © 


As with any analysis, when reporting factor analysis we need to provide our readers with 
enough information to make an informed opinion about our data. As a bare minimum we 
should be very clear about our criteria for extracting factors and the method of rotation 
used. We must also produce a table of the rotated factor loadings of all items and flag (in 
bold) values above a criterion level (I would personally choose .40, but I discussed the 
various criteria you could use in section 17.3.9.2). You should also report the percentage 
of variance that each factor explains and possibly the eigenvalue too. Table 17.1 shows an 
example of such a table for the RAQ data; note that I have also reported the sample size 
in the title. 

In my opinion, a table of factor loadings and a description of the analysis are a bare mini¬ 
mum, though. You could consider (if it’s not too large) including the table of correlations 
from which someone could reproduce your analysis (should they want to). You could also 
consider including some information on sample size adequacy. 

For this example we might write something like this: 

/ A principal components analysis (PCA) was conducted on the 23 items with orthog¬ 
onal rotation (varimax). The Kaiser-Meyer-Olkin measure verified the sampling 
adequacy for the analysis KMO = .93 (‘superb’ according to Kaiser, 1974), and all 
KMO values for individual items were > .77, which is well above the acceptable 
limit of .5. Bartlett’s test of sphericity, x 2 (253) = 19,334, p < .001, indicated that 
correlations between items were sufficiently large for PCA. An initial analysis was 
run to obtain eigenvalues for each component in the data. Four components had 
eigenvalues over Kaiser’s criterion of 1 and in combination explained 50.32% of the 
variance. The scree plot was slightly ambiguous and showed inflexions that would 
justify retaining both two and four components. Given the large sample size, and 
the convergence of the scree plot and Kaiser’s criterion on four components, four 
components were retained in the final analysis. Table 17.1 shows the factor loadings 
after rotation. The items that cluster on the same components suggest that compo¬ 
nent 1 represents a fear of computers, component 2 a fear of statistics, component 
3 a fear of maths and component 4 peer evaluation concerns. 
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Table 17.1 Summary of exploratory factor analysis results for the R anxiety questionnaire 
(A/ = 2571) 



Varimax rotated factor loadings 


Fear of 

Fear of 

Peer 

Fear of 

Item 

computers 

statistics 

evaluation 

maths 

1 have little experience of computers 

.80 

-.01 

-.07 

.10 

R always crashes when 1 try to use it 

.68 

.33 

-.08 

.13 

1 worry that 1 will cause irreparable damage 
because of my incompetence with 
computers 

.65 

.23 

-.10 

.23 

All computers hate me 

.64 

.33 

-.08 

.16 

Computers have minds of their own and 
deliberately go wrong whenever 1 use them 

.58 

.36 

-.07 

.14 

Computers are useful only for playing 
games 

.55 

.00 

-.12 

.13 

Computers are out to get me 

.46 

.22 

-.19 

.29 

1 can’t sleep for thoughts of eigen vectors 

-.04 

.68 

-.14 

.08 

1 wake up under my duvet thinking that 1 
am trapped under a normal distribution 

.29 

.66 

-.07 

.16 

Standard deviations excite me 

-.20 

-.57 

.37 

-.18 

People try to tell you that R makes statistics 
easier to understand but it doesn’t 

.47 

.52 

-.08 

.10 

1 dream that Pearson is attacking me with 
correlation coefficients 

.32 

.52 

.04 

.31 

1 weep openly at the mention of central 
tendency 

.33 

.51 

-.12 

.31 

Statistics makes me cry 

.24 

.50 

.06 

.36 

1 don’t understand statistics 

.32 

.43 

.02 

.24 

1 have never been good at mathematics 

.13 

.17 

.01 

.83 

1 slip into a coma whenever 1 see an 
equation 

.27 

.22 

-.04 

.75 

1 did badly at mathematics at school 

.26 

.21 

-.14 

.75 

My friends are better at statistics than me 

-.09 

-.20 

.65 

.12 

My friends are better at R than 1 am 

-.19 

.03 

.65 

-.10 

If I’m good at statistics my friends will think 
I’m a nerd 

-.02 

.17 

.59 

-.20 

My friends will think I’m stupid for not being 
able to cope with R 

-.01 

-.34 

.54 

.07 

Everybody looks at me when 1 use R 

-.15 

-.37 

.43 

-.03 


Eigenvalues 

3.73 

3.34 

1.95 

2.55 

% of variance 

16.22 

14.52 

8.48 

11.10 

a 

.82 

.82 

.57 

.82 


Note: Factor loadings over .40 appear in bold. 
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Finally, if you have used oblique rotation you should consider reporting a table of both the 
structure and pattern matrix because the loadings in these tables have different interpreta¬ 
tions (see Jane Superbrain Box 17.1). 



Labcoat Leni’s Real Research 17.1 


World wide addiction? 


Nichols, L. A., & Nicki, R. (2004). Psychology of Addictive Behaviors, 78(4), 381-384. 


The Internet is now a houshold tool. In 2007 it was estimated that around 179 million people worldwide used the 
Internet (over 100 million of those were in the USA and Canada). From the increasing popularity (and usefulness) 
of the Internet has emerged a new phenomenon: Internet addiction. This is now a serious and recognized problem, 
but until very recently it was very difficult to research this topic because there was not a psychometrically sound 
measure of Internet addition. That is, until Laura Nichols and Richard Nicki developed the Internet Addiction Scale, 
IAS (Nichols & Nicki, 2004). (Incidentally, while doing some research on this topic I encountered an Internet addic¬ 
tion recovery website that I won’t name but that offered a whole host of resources that would keep you online for 
ages, such as questionnaires, an online support group, videos, articles, a recovery blog and podcasts. It struck 
me that this was a bit like having a recovery centre for heroin addiction where the addict arrives to be greeted by a 
nice-looking counsellor who says ‘there’s a huge pile of heroin in the corner over there, just help yourself.) 

Anyway, Nichols and Nicki developed a 36-item questionnaire to measure internet addiction. It contained items 
such as ‘I have stayed on the Internet longer than I intended to’ and ‘My grades/work have suffered because of 
my Internet use’, which could be responded to on a 5-point scale (Never, Rarely, Sometimes, Frequently, Always). 
They collected data from 207 people to validate this measure. 

The data from this study are in the file Nichols & Nicki (2004).dat. The authors dropped two items because 
they had low means and variances, and dropped three others because of relatively low correlations with other 
items. They performed a principal components analysis on the remaining 31 items. Labcoat Leni wants you to run 
some descriptive statistics to work out which two items were dropped for having low means/variances, 
then inspect a correlation matrix to find the three items that were dropped for having low correlations. 
Finally, he wants you to run a principal components analysis on the data. 

'Answers are in the additional material on the companion website (or look at the original article). 


17.8. Reliability analysis © 


17.8.1. 


Measures of reliability © 


If you’re using factor analysis to validate a questionnaire, it is useful to check the reliability 
of your scale. 



SELF-TEST 

v' Thinking back to Chapter 1, what are reliability and 
test-retest reliability? 
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Reliability means that a measure (or in this case questionnaire) should con¬ 
sistently reflect the construct that it is measuring. One way to think of this is 
that, other things being equal, a person should get the same score on a ques¬ 
tionnaire if they complete it at two different points in time (we have already 
discovered that this is called test-retest reliability). So, someone who is terrified 
of statistics and who scores highly on our RAQ should score similarly highly 
if we tested them a month later (assuming they hadn’t gone into some kind of 
statistics-anxiety therapy in that month). Another way to look at reliability is to 
say that two people who are the same in terms of the construct being measured 
should get the same score. So, if we took two people who were equally statistics-phobic, 
then they should get more or less identical scores on the RAQ. Likewise, if we took two 
people who loved statistics, they should both get equally low scores. It should be appar¬ 
ent that if we took someone who loved statistics and someone who was terrified of it, and 
they got the same score on our questionnaire, then it wouldn’t be an accurate measure of 
statistical anxiety. In statistical terms, the usual way to look at reliability is based on the idea 
that individual items (or sets of items) should produce results consistent with the overall 
questionnaire. So, if we take someone scared of statistics, then their overall score on the 
RAQ will be high; if the RAQ is reliable then if we randomly select some items from it the 
person’s score on those items should also be high. 

The simplest way to do this in practice is to use split-half reliability. This method ran¬ 
domly splits the data set into two. A score for each participant is then calculated based on 
each half of the scale. If a scale is very reliable a person’s score on one half of the scale 
should be the same (or similar) to their score on the other half: therefore, across several 
participants, scores from the two halves of the questionnaire should correlate perfectly 
(well, very highly). The correlation between the two halves is the statistic computed in the 
split-half method, with large correlations being a sign of reliability. The problem with this 
method is that there are several ways in which a set of data can be split into two and so 
the results could be a product of the way in which the data were split. To overcome this 
problem, Cronbach (1951) came up with a measure that is loosely equivalent to splitting 
data in two in every possible way and computing the correlation coefficient for each split. 
The average of these values is equivalent to Cronbach’s alpha, a, which is the most common 
measure of scale reliability. 11 

Cronbach’s a is: 



N z Cov 

^ $ item + X Cov item 


(17.6) 


which may look complicated, but actually isn’t. The first thing to note is that for each item 
on our scale we can calculate two things: the variance within the item, and the covari¬ 
ance between a particular item and any other item on the scale. Put another way, we can 
construct a variance-covariance matrix of all items. In this matrix the diagonal elements 
will be the variance within a particular item, and the off-diagonal elements will be covari¬ 
ances between pairs of items. The top half of the equation is simply the number of items 
(N) squared multiplied by the average covariance between items (the average of the off- 
diagonal elements in the aforementioned variance-covariance matrix). The bottom half is 


11 Although this is the easiest way to conceptualize Cronbach’s a, whether or not it is exactly equal to the average 
of all possible split-half reliabilities depends on exactly how you calculate the split-half reliability (see the glossary 
for computational details). If you use the Spearman-Brown formula, which takes no account of item standard 
deviations, then Cronbach’s a will be equal to the average split-half reliability only when the item standard devia¬ 
tions are equal; otherwise a will be smaller than the average. However, if you use a formula for split-half reli¬ 
ability that does account for item standard deviations (such as Flanagan, 1937; Rulon, 1939) then a will always 
equal the average split-half reliability (see Cortina, 1993). 
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just the sum of all the item variances and item covariances (i.e., the sum of everything in 
the variance-covariance matrix). 

There is a standardized version of the coefficient too, which essentially uses the same 
equation except that correlations are used rather than covariances, and the bottom half of 
the equation uses the sum of the elements in the correlation matrix of items (including the 
ones that appear on the diagonal of that matrix). The normal alpha is appropriate when 
items on a scale are summed to produce a single score for that scale (the standardized alpha 
is not appropriate in these cases). The standardized alpha is useful, though, when items on 
a scale are standardized before being summed. 


17.8.2. 


Interpreting Cronbach’s a (some cautionary tales ...) 


You’ll often see in books, journal articles, or be told by people that a value of .7 to .8 is an 
acceptable value for Cronbach’s a; values substantially lower indicate an unreliable scale. 
Kline (1999) notes that although the generally accepted value of .8 is appropriate for cog¬ 
nitive tests such as intelligence tests, for ability tests a cut-off point of .7 is more suitable. 
He goes on to say that when dealing with psychological constructs values below even .7 
can, realistically, be expected because of the diversity of the constructs being measured. 

However, Cortina (1993) notes that such general guidelines need to be used with cau¬ 
tion because the value of a depends on the number of items on the scale. You’ll notice that 
the top half of the equation for a includes the number of items squared. Therefore, as the 
number of items on the scale increases, a will increase. Therefore, it’s possible to get a large 
value of a because you have a lot of items on the scale! For example, Cortina reports data 
from two scales, both of which have a = .8. The first scale has only three items, and the 
average correlation between items was a respectable .57; however, the second scale had 10 
items, with an average correlation between these items of a less respectable .28. Clearly the 
internal consistency of these scales differs enormously, yet they are both equally reliable 

A second common interpretation of alpha is that it measures ‘unidimensionality’, or the 
extent to which the scale measures one underlying factor or construct. This interpretation 
stems from the fact that when there is one factor underlying the data, a is a measure of the 
strength of that factor (see Cortina, 1993). However, Grayson (2004) demonstrates that 
data sets with the same a can have very different structures. He showed that a = .8 can 
be achieved in a scale with one underlying factor, with two moderately correlated factors 
and with two uncorrelated factors. Cortina (1993) has also shown that with more than 
12 items, and fairly high correlations between items (r > .5), a can reach values around 
and above .7 (.65 to .84). These results compellingly show that a should not be used as a 
measure of ‘unidimensionality’. Indeed, Cronbach (1951) suggested that if several factors 
exist then the formula should be applied separately to items relating to different factors. 
In other words, if your questionnaire has subscales, a should be applied separately to these 
subscales. 

The final warning is about items that have a reverse phrasing. For example, in our RAQ 
that we used in the factor analysis part of this chapter, we had one item (question 3) that 
was phrased the opposite way around to all other items. The item was ‘standard deviations 
excite me’. Compare this to any other item and you’ll see it requires the opposite response. 
For example, item 1 is ‘statistics make me cry’. Now, if you don’t like statistics then you’ll 
strongly agree with this statement and so will get a score of 5 on our scale. For item 3, if 
you hate statistics then standard deviations are unlikely to excite you so you’ll strongly 
disagree and get a score of 1 on the scale. These reverse-phrased items are important for 
reducing response bias; participants will actually have to read the items in case they are 
phrased the other way around. For factor analysis, this reverse phrasing doesn’t matter, 
all that happens is you get a negative factor loading for any reversed items (in fact, look 
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at Output 17.10 and you’ll see that item 3 has a negative factor loading). 
However, in reliability analysis these reverse-scored items do make a differ¬ 
ence. To see why, think about the equation for Cronbach’s a. In this equa¬ 
tion, the top half incorporates the average covariance between items. If an 
item is reverse-phrased then it will have a negative relationship with other 
items, hence the covariances between this item and other items will be nega¬ 
tive. The average covariance is obviously the sum of covariances divided 
by the number of covariances, and by including a bunch of negative values 
we reduce the sum of covariances, and hence we also reduce Cronbach’s a, 
because the top half of the equation gets smaller. In extreme cases, it is even 
possible to get a negative value for Cronbach’s a, simply because the magni¬ 
tude of negative covariances is bigger than the magnitude of positive ones. 
A negative Cronbach’s a doesn’t make much sense, but it does happen, and if it does, ask 
yourself whether you included any reverse-phrased items. 


17.8.3. 


Reliability analysis with R Commander © 


As with factor analysis, it’s possible to use R Commander to obtain reliability estimates. 
However, the procedure is not as flexible as the alpha() function in the psych package, so 
that’s the one we use. 


17.8.4. 


Reliability analysis using R © 


Let’s test the reliability of the RAQ using the data in RAQ.dat. Remember also that I said 
we should conduct reliability analysis on any subscales individually. If we use the results 
from our orthogonal rotation (look back at), then we have four subscales: 

1 Subscale 1 {Fear of computers)-, items 6, 7, 10, 13, 14, 15, 18 

2 Subscale 2 {Fear of statistics): items 1, 3, 4, 5, 12, 16, 20, 21 

3 Subscale 3 {Fear of mathematics): items 8, 11, 17 

4 Subscale 4 {Peer evaluation): items 2, 9, 19, 22, 23 

(Don’t forget that question 3 has a negative sign; we’ll need to remember to deal with that.) 
First, we’ll create four new data sets, containing the subscales for the items. We don’t need 
to do that, but it saves a lot of typing later on. We can create these data sets by simply select¬ 
ing the appropriate columns of the full dataframe {raqData) as described in section 3.9.1. 

computerFear<-raqData[, c(6, 7, 10, 13, 14, 15, 18)] 
statisticsFear <- raqData[, c(l, 3, 4, 5, 12, 16, 20, 21)] 
mathFear <- raqData[, c(8, 11, 17)] 
peerEvaluation <- raqData[, c(2, 9, 19, 22, 23)] 

This command takes the raqData dataframe and retains all of the rows (hence no com¬ 
mand before the comma), and any columns specified in the cQ function after the comma. 
For example, the first command creates an object called computerFear that contains only 
columns 6, 7, 10, 13, 14, 15, and 18 of the dataframe raqData. 

Reliability analysis is done with the alpha() function, which is found in the psych pack¬ 
age. You might have a problem here, because there is also a function in ggplot2 called alpha, 
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and if you’ve loaded ggplot2 first, that version will have priority. This was covered in R’s 
Souls’ Tip 3.4, but to remind you, if you get the wrong alphaQ function, you can specify 
the package, using: 

psych::alpha() 

An additional complication that we need to deal with is that pesky item 3, which is 
negatively scored. We can do one of two things with this item. We can reverse the variable 
in the data set, or we can tell alphaQ that it is negative, using the keys option. This latter 
option is better because we leave the initial data unchanged (which is useful because we 
don’t get into awkward situations in which we save the data and then can’t recall at a later 
date whether or not the data contains the reverse scored or original scores). 

To use the keys option we give alphaQ a vector of Is and -Is, which matches the number 
of variables in the data set, using a 1 for a positively score item and a -1 for a negatively 
scored item. So for computerFear, which has only positively scored items, we would use: 

keys = c(l, 1, 1, 1, 1, 1, 1) 

but for statisticsFear, which has item 3 (the negatively scored item) as its second item, we 
would use: 

keys = c(l, -1, 1, 1, 1, 1, 1, 1) 

For three of our four subscales we don’t need to use the keys option because all items are 
positively scored, but for statisticsFear we need to. To use the alphaQ function we simply 
input the name of the dataframe for each subscale, and, where necessary, include the keys 
option. Therefore, we could run the reliability analysis for our four subscales by executing: 

alpha(computerFear) 

alpha(statisticsFear, keys = c(l, -1, 1, 1, 1, 1, 1, 1)) 

alpha(mathFear) 

alpha(peerEvaluation) 


17.8.5. 


Interpreting the output (D 


Output 17.13 shows the results of this basic reliability analysis for the fear of computing 
subscale. First, and perhaps most important, the value of alpha at the very top is Cronbach’s 
a: the overall reliability of the scale (you should look at the raw alpha, they’re usually very 
similar though). To reiterate, we’re looking for values in the range of .7 to .8 (or therea¬ 
bouts) bearing in mind what we’ve already noted about effects from the number of items. 
In this case a is slightly above .8, and is certainly in the region indicated by Kline (1999), 
so this probably indicates good reliability. 

Along with alpha, there is a measure labelled G6, short for Guttman’s lambda 6; this can 
be calculated from the squared multiple correlation (hence it’s labelled smc ). 12 The average_r 
is the average inter-item correlation (from which we can calculate standardized alpha). 

Also in this top section are some scale characteristics. If we calculated someone’s score 
by taking the average of all of their items (which is the same as adding up the score and 
dividing by the number of items), we would have a variable with an overall mean of 3.4 
and standard deviation of 0.71. 13 


12 Fact fiends might be interested to know that Guttman came up with Cronbach’s alpha before Cronbach, and 
called it lambda 3. 

13 You can test this by running: 

describe(apply((raq[c(6, 7, 10, 13, 14, 15, 18)]), 1, mean)) 
which gives you a mean of 3.42 and sd = 0.71. 
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Next, we get a table giving the statistics for the scale if we deleted each item in turn. 
The values in the column labelled raw_alpha are the values of the overall a if that item 
isn’t included in the calculation. As such, they reflect the change in Cronbach’s a that 
would be seen if a particular item were deleted. The overall a is .82, and so all values in 
this column should be around that same value. What we’re actually looking for is values of 
alpha greater than the overall a. If you think about it, if the deletion of an item increases 
Cronbach’s a then this means that the deletion of that item improves reliability (remember¬ 
ing that scales with more items are more reliable, so removing an item should always lower 
alpha). Therefore, any items that have values of a in this column greater than the overall 
a may need to be deleted from the scale to improve its reliability. None of the items here 
would substantially affect reliability if they were deleted. None of the items increase alpha 
by being deleted. This table also contains the standardized alpha if the item is removed, the 
G6 if the item is removed and the mean correlation if the item is removed. 

The next table in the output is labelled item statistics. The values in the first column 
labelled r are the correlations between each item and the total score from the question¬ 
naire - sometimes called item-total correlations. There’s a problem with this statistic, and 
that is that the item is included in the total. That is, if we correlate item 6 with the mean 
of all items, we’re correlating the item with itself, so of course it will correlate. We can 
correct this by correlating each item with all of the other items. Two versions of this are 
presented, r.cor and r.drop: r.cor is a little complex, so we won’t go into it (but the help file 
for alpha explains it), r.drop is the correlation of that item with the scale total if that item 
isn’t included in the scale total. Sometimes this is called the item-rest correlation (because 
it’s how the item correlates with the rest of the items) and sometimes it’s called the cor¬ 
rected item-total correlation. 

Reliability analysis 

Call: alpha(x = computerFear) 

raw_alpha std.alpha G6(smc) average_r mean sd 



0. 

82 

0.82 0.1 

81 

0. 

.4 3 

Reliability : 

if an item is 

dropped: 



raw_. 

alpha 

std.alpha G6(smc) 

average_r 

Q06 


0.79 

0.79 

0.77 


0.38 

Q07 


0.79 

0.79 

0.77 


0.38 

Q10 


0.82 

0.82 

0.80 


0.44 

Q13 


0.79 

0.79 

0.77 


0.39 

Q14 


0.80 

0.80 

0.77 


0.39 

Q15 


0.81 

0.81 

0.79 


0.41 

Q18 


0.79 

0.78 

0.76 


0.38 

Item statistics 





n 

r 

r.cor r.drop 

mean 

sd 


Q06 

2571 

0.74 

0.68 0.62 

3.8 

1.12 


Q07 

2571 

0.73 

0.68 0.62 

3.1 

1.10 


Q10 

2571 

0.57 

0.44 0.40 

3.7 

0.88 


Q13 

2571 

0.73 

0.67 0.61 

3.6 

0.95 


Q14 

2571 

0.70 

0.64 0.58 

3.1 

1.00 


Q15 

2571 

0.64 

0.54 0.49 

3.2 

1.01 


Q18 

2571 

0.76 

0.72 0.65 

3.4 

1.05 



Non missing response frequency for each item 
12345 miss 
Q06 0.06 0.10 0.13 0.44 0.27 0 

Q07 0.09 0.24 0.26 0.34 0.07 0 
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Q10 0.02 
Q13 0.03 
Q14 0.07 
Q15 0.06 
Q18 0.06 

Output 1 


0.10 0.18 
0.12 0.25 
0.18 0.38 
0.18 0.30 
0.12 0.31 

.13 


0.57 0.14 
0.48 0.12 
0.31 0.06 
0.39 0.07 
0.37 0.14 


0 

0 

0 

0 

0 


In a reliable scale all items should correlate with the total. So, we’re looking for items 
that don’t correlate with the overall score from the scale: if any of these values of r.drop 
are less than about .3 then we’ve got problems, because it means that a particular item does 
not correlate very well with the scale overall. Items with low correlations may have to be 
dropped. For these data, all data have corrected item-total correlations above .3, which is 
encouraging. The table also shows the mean and standard deviation of the scale if the item 
is omitted. 

The final table in the alpha output is a table of frequencies. It tells us what percentage of 
people gave each response to each of the items. This is useful to make sure that everyone 
in your sample is not giving the same response. It is usually the case that an item where 
everyone (or almost everyone) gives the same response will almost certainly have poor 
reliability statistics. 

As a final point, it’s worth noting that if items do need to be removed at this stage then 
you should rerun your factor analysis as well to make sure that the deletion of the item has 
not affected the factor structure. 


Reliability analysis 

Call: alpha(x = statisticsFear, keys = c(l, - 1 , 1 , 1 , 1 , 1 , 1 , 1)) 


raw_alpha std.alpha G6(smc) average_r mean sd 
0.82 0.82 0.81 0.37 3.1 0.5 

Reliability if an item is dropped: 

raw_alpha std.alpha G6(smc) average_r 


Q01 


0.80 

0.80 

0.79 


0.37 

Q03 


0.80 

0.80 

0.79 


0.37 

Q04 


0.80 

0.80 

0.78 


0.36 

Q05 


0.81 

0.81 

0.80 


0.38 

Q12 


0.80 

0.80 

0.79 


0.36 

Q16 


0.79 

0.80 

0.78 


0.36 

Q20 


0.82 

0.82 

0.80 


0.40 

Q21 


0.79 

0.80 

0.78 


0.36 

Item statistics 





n 

r 

r.cor r.drop 

mean 

sd 


Q01 

2571 

0.67 

0.60 0.54 

3.6 

0.83 


Q03 

2571 

0.67 

0.60 0.55 

3.4 

1.08 


Q04 

2571 

0.70 

0.64 0.58 

3.2 

0.95 


Q05 

2571 

0.63 

0.55 0.49 

3.3 

0.96 


Q12 

2571 

0.69 

0.63 0.57 

2.8 

0.92 


Q16 

2571 

0.71 

0.67 0.60 

3.1 

0.92 


Q20 

2571 

0.56 

0.47 0.42 

2.4 

1.04 


Q21 

2571 

0.71 

0.67 0.61 

2.8 

0.98 



Non missing response frequency for each item 
12345 miss 
Q01 0.02 0.07 0.29 0.52 0.11 0 

Q03 0.03 0.17 0.34 0.26 0.19 0 

Q04 0.05 0.17 0.36 0.37 0.05 0 
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Q05 

0 . 

.04 

0 . 

.18 

0 . 

.29 

0 . 

.43 

0 . 

.06 

0 

Q12 

0 . 

.09 

0 . 

.23 

0 . 

.46 

0 . 

.20 

0 . 

.02 

0 

Q16 

0 . 

.06 

0 . 

.16 

0 . 

. 42 

0 . 

.33 

0 . 

.04 

0 

Q20 

0 . 

.22 

0 . 

.37 

0 . 

.25 

0 . 

.15 

0 . 

.02 

0 

Q21 

0 . 

.09 

0 . 

.29 

0 . 

.34 

0 . 

.26 

0 . 

.02 

0 


Output 17.14 

OK, let’s move on to the fear of statistics subscale (items 1, 3, 4, 5, 12, 16, 20 and 21). I 
won’t go through the R output in detail again, but it is shown in Output 17.14. The over¬ 
all a is .82, and none of the items here would increase the reliability if they were deleted. 
The values in the column labelled r.drop are again all above .3, which is good. In all, this 
indicates that all items are positively contributing to the overall reliability. The overall a is 
also excellent (.82) because it is above .8, and indicates good reliability. 

Reliability analysis 

Call: alpha(x = statisticsFear) 

raw_alpha std.alpha G6(smc) average_r mean sd 
0.61 0.64 0.71 0.18 3.1 0.5 

Reliability if an item is dropped: 

raw_alpha std.alpha G6(smc) average_r 


Q01 



0.52 


0. 

.56 

0 

.64 

0. 

15 

Q03 



0.80 


0. 

.80 

0 

.79 

0. 

37 

Q04 



0.50 


0. 

.55 

0 

.64 

0. 

15 

Q05 



0.52 


0. 

.57 

0 

.66 

0. 

16 

Q12 



0.52 


0. 

.56 

0 

.65 

0. 

15 

Q16 



0.51 


0. 

.55 

0 

.63 

0. 

15 

Q20 



0.56 


0. 

.60 

0 

.68 

0. 

18 

Q21 



0.50 


0. 

.55 

0 

.63 

0. 

15 

Item 

statistics 









n 

r r.cor 

r. 

drop mean 

sd 


Q01 

2571 

0.68 

0 . 

62 


0.51 

3.6 0. 

.83 


Q03 

2571 

-0.37 - 

■0 . 

64 

- 

0.55 

3.4 1. 

.08 


Q04 

2571 

0.69 

0 . 

65 


0.53 

3.2 0. 

.95 


Q05 

2571 

0.65 

0 . 

57 


0.47 

3.3 0. 

.96 


Q12 

2571 

0.67 

0 . 

62 


0.50 

2.8 0. 

.92 


Q16 

2571 

0.70 

0 . 

66 


0.53 

3.1 0. 

.92 


Q20 

2571 

0.55 

0 . 

45 


0.35 

2.4 1. 

.04 


Q21 

2571 

0.70 

0 . 

66 


0.54 

2.8 0. 

.98 


Non 

missing response 

frequency for 

each i 



i 

2 

3 


4 

5 

miss 



Q01 

0. 

.02 

0.07 0. 

29 

0. 

.52 

0.11 

0 



Q03 

0 . 

.03 

0.17 0 . 

34 

0. 

.26 

0.19 

0 



Q04 

0. 

.05 

0.17 0 . 

36 

0. 

.37 

0.05 

0 



Q05 

0 . 

.04 

0.18 0. 

29 

0. 

.43 

0.06 

0 



Q12 

0. 

.09 

0.23 0. 

46 

0. 

.20 

0.02 

0 



Q16 

0 . 

.06 

0.16 0. 

42 

0. 

.33 

0.04 

0 



Q20 

0 . 

.22 

0.37 0. 

25 

0. 

.15 

0.02 

0 



Q21 

0 . 

.09 

0.29 0. 

34 

0. 

.26 

0.02 

0 




Output 17.15 

Just to illustrate the importance of reverse-scoring items before running reliability ana¬ 
lysis, Output 17.15 shows the reliability analysis for the fear of statistics subscale but done 
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on the original data (i.e., without item 3 being reverse scored by using the keys option). 
Note that the overall a is considerably lower (.61 rather than .82). Also, note that this item 
has a negative item-total correlation (which is a good way to spot if you have a potential 
reverse-scored item in the data that hasn’t been reverse scored). Finally, note that for item 
3, the a if item deleted is .8. That is, if this item were deleted then the reliability would 
improve from about .6 to about .8. This, I hope, illustrates that failing to reverse-score 
items that have been phrased oppositely to other items on the scale will mess up your reli¬ 
ability analysis. 

Moving swiftly on to the fear of maths subscale (items 8, 11 and 17), Output 17.16 
shows the output from the analysis. As with the previous two subscales, the overall 
a is around .8, which indicates good reliability. The values of alpha if the item were 
deleted indicate that none of the items here would increase the reliability if they were 
deleted because all values in this column are less than the overall reliability of .82. The 
values of the corrected item-total correlations ( r.drop ) are again all above .3, which 
is good, 

Reliability analysis 
Call: alpha(x = mathFear) 

raw_alpha std.alpha G6(smc) average_r mean sd 
0.82 0.82 0.75 0.6 3.7 0.75 

Reliability if an item is dropped: 

raw_alpha std.alpha G6(smc) average_r 


Q08 


0.74 

0.74 

0.59 


0.59 

Qll 


0.74 

0.74 

0.59 


0.59 

Q17 


0.77 

0.77 

0.63 


0.63 

Item statistics 





n 

r 

r.cor r.drop 

mean 

sd 


Q08 

2571 

0.86 

0.76 0.68 

3.8 

0.87 


Qll 

2571 

0.86 

0.75 0.68 

3.7 

0.88 


Q17 

2571 

0.85 

0.72 0.65 

3.5 

0.88 



Non missing response frequency for each item 




1 


2 


3 


4 


5 

miss 

Q08 

0 . 

.03 

0 . 

.06 

0 

.19 

0 . 

.58 

0 . 

.15 

0 

Qll 

0 . 

.02 

0 . 

.06 

0 

.22 

0 . 

.53 

0 . 

.16 

0 

Q17 

0 . 

.03 

0 . 

.10 

0 

.27 

0 . 

.52 

0 . 

.08 

0 


Output 17.16 

Finally, if you run the analysis for the final subscale of peer evaluation, you should get 
Output 17.17. Unlike the previous subscales, the overall a is quite low at .57 and although 
this is in keeping with what Kline says we should expect for this kind of social science 
data, it is well below the other scales. The values of alpha if the item is dropped indicate 
that none of the items here would increase the reliability if they were deleted because all 
values in this column are less than the overall reliability of .57. The values of r.drop are all 
around .3, and in fact for item 23 the value is below .3. This indicates fairly bad internal 
consistency and identifies item 23 as a potential problem. The scale has five items, com¬ 
pared to seven, eight and three on the other scales, so its reduced reliability is not going to 
be dramatically affected by the number of items (in fact, it has more items than the fear of 
maths subscale). If you look at the items on this subscale, they cover quite diverse themes 
of peer evaluation, and this might explain the relative lack of consistency. This might lead 
us to rethink this subscale. 
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Reliability analysis 

Call: alpha(x = peerEvaluation) 

raw_alpha std.alpha G6(smc) average_r mean sd 
0.57 0.57 0.53 0.21 3.4 0.65 

Reliability if an item is dropped: 

raw_alpha std.alpha G6(smc) average_r 


Q02 


0.52 

0.52 

0.45 


0.21 

Q09 


0.48 

0.48 

0.41 


0.19 

Q19 


0.52 

0.53 

0.46 


0.22 

Q22 


0.49 

0.49 

0.43 


0.19 

Q23 


0.56 

0.57 

0.50 


0.25 

Item statistics 





n 

r 

r.cor r.drop 

mean 

sd 


Q02 

2571 

0.61 

0.45 0.34 

4.4 

0.85 


Q09 

2571 

0.66 

0.53 0.39 

3.2 

1.26 


Q19 

2571 

0.60 

0.42 0.32 

3.7 

1.10 


Q22 

2571 

0.64 

0.50 0.38 

3.1 

1.04 


Q2 3 

2571 

0.53 

0.31 0.24 

2.6 

1.04 



Non missing response frequency for each item 
12345 miss 


Q02 0.01 0.04 0.08 0.31 0.56 0 
Q09 0.08 0.28 0.23 0.20 0.20 0 
Q19 0.02 0.15 0.22 0.33 0.29 0 
Q22 0.05 0.26 0.34 0.26 0.10 0 
Q23 0.12 0.42 0.27 0.12 0.06 0 
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CRAMMING SAM’S TIPS 


Reliability 


• Reliability is really the consistency of a measure. 

• Reliability analysis can be used to measure the consistency of a questionnaire. 

• Remember to deal with reverse-scored items. Use the keys option when you run the analysis. 

• Run separate reliability analyses for all subscales of your questionnaire. 

• Cronbach’s a indicates the overall reliability of a questionnaire and values around .8 are good (or .7 for ability tests and such 
like). 

• The raw alpha when an item is dropped tells you whether removing an item will improve the overall reliability: values greater 
than the overall reliability indicate that removing that item will improve the overall reliability of the scale. Look for items that 
dramatically increase the value of a. 

• If you do remove items, rerun your factor analysis to check that the factor structure still holds! 


17.9. Reporting reliability analysis © 


You can report the reliabilities in the text using the symbol a and remembering that because 
Cronbach’s a can’t be larger than 1 then we drop the zero before the decimal place (if we 
are following APA format): 
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• The fear of computers, fear of statistics and fear of maths subscales of the RAQ all 
had high reliabilities, all Cronbach’s a = .82. However, the fear of negative peer 
evaluation subscale had relatively low reliability, Cronbach’s a = .57. 

However, the most common way to report reliability analysis when it follows a factor 
analysis is to report the values of Cronbach’s a as part of the table of factor loadings. For 
example, in Table 17.1 notice that in the last row of the table I have quoted the value of 
Cronbach’s a for each subscale in turn. 



What have I discovered about statistics? © 


This chapter has made us tiptoe along the craggy rock face that is factor analysis. This is a 
technique for identifying clusters of variables that relate to each other. One of the difficult 
things with statistics is realizing that they are subjective: many books (this one included, 
I suspect) create the impression that statistics are like a cook book and if you follow the 
instructions you’ll get a nice tasty chocolate cake (yum!). Factor analysis perhaps more 
than any other test in this book illustrates how incorrect this is. The world of statistics is 
full of arbitrary rules that we probably shouldn’t follow (.05 being the classic example) 
and nearly all of the time, whether you realize it or not, we should act upon our own 
discretion. So, if nothing else, I hope you’ve discovered enough to give you sufficient dis¬ 
cretion about factor analysis to act upon! We saw that the first stage of factor analysis is 
to scan your variables to check that they relate to each other to some degree but not too 
strongly. The factor analysis itself has several stages: check some initial issues (e.g., sample 
size adequacy), decide how many factors to retain, and finally decide which items load on 
which factors (and try to make sense of the meaning of the factors). Having done all that, 
you can consider whether the items you have are reliable measures of what you’re trying 
to measure. 

We also discovered that at the age of 23 I took it upon myself to become a living 
homage to the digestive system. I furiously devoured articles and books on statistics 
(some of them I even understood), I mentally chewed over them, I broke them down 
with the stomach acid of my intellect, I stripped them of their goodness and nutrients, I 
compacted them down, and after about two years I forced the smelly brown remnants of 
those intellectual meals out of me in the form of a book. I was mentally exhausted at the 
end of it; ‘It’s a good job I’ll never have to do that again’, I thought. 


R packages used in this chapter 


corpcor 

GPArotation 


psych 
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R functions used in this chapter 


abs() 

alphaO 

as.matrixO 

cO 

catO 

cbind() 

corO 

cortest.bartlettO 

detO 

factor, model () 
factor.residualsO 
factor.structureO 
ggpiot2() 


histO 

kmoO 

mean() 

nrow() 

plotO 

polychor() 
principalO 
print.psych() 
residual.statsO 
round() 
sqrt() 
sum() 
upper.triQ 


Key terms that I’ve discovered 



Alpha factoring 

Bartlett’s test of sphericity 

Common variance 

Communality 

Component matrix 

Confirmatory factor analysis (CFA) 

Cronbach’s a 

Direct oblimin 

Extraction 

Factor 

Factor analysis 
Factor loading 
Factor matrix 
Factor scores 

Factor transformation matrix, 
Kaiser’s criterion 


Kaiser-Meyer-Olkin (KMO) measure of 
sampling adequacy 
Latent variable 
Oblique rotation 
Orthogonal rotation 
Pattern matrix 

Principal components analysis (PCA) 

Promax 

Quartimax 

Random variance 

Rotation 

Scree plot 

Singularity 

Split-half reliability 

Structure matrix 

Unique variance 

Varimax 


Smart Alex’s tasks 


• Task 1: The University of Sussex is constantly seeking to employ the best people 
possible as lecturers (no, really, it is). Anyway, they wanted to revise a questionnaire 
based on Bland’s theory of research methods lecturers. This theory predicts that good 
research methods lecturers should have four characteristics: (1) a profound love 
of statistics; (2) an enthusiasm for experimental design; (3) a love of teaching; and 
(4) a complete absence of normal interpersonal skills. These characteristics should 
be related (i.e., correlated). The ‘Teaching of Statistics for Scientific Experiments’ 
(TOSSE) already existed, but the university revised this questionnaire and it became 
the ‘Teaching of Statistics for Scientific Experiments - Revised’ (TOSSE-R). They 
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gave this questionnaire to 239 research methods lecturers around the world to see if 
it supported Bland’s theory. The questionnaire is in Figure 17.9, and the data are in 
TOSSE.R.dat. Conduct a factor analysis (with appropriate rotation) to see the factor 
structure of the data. © 


SD = Strongly Disagree, D = Disagree, N = Neither, A = Agree, SA = Strongly Agree 




SD 

D 

N 

A 

SA 

1 

1 once woke up in the middle of a vegetable patch hugging a turnip that 
I’d mistakenly dug up thinking it was Roy's largest root 

O 

o 

o 

o 

o 

2 

If 1 had a big gun I’d shoot all the students 1 have to teach 

o 

o 

o 

o 

o 

3 

1 memorize probability values for the F-distribution 

o 

o 

o 

o 

o 

4 

1 worship at the shrine of Pearson 

o 

o 

o 

o 

o 

5 

1 still live with my mother and have little personal hygiene 

o 

o 

o 

o 

o 

6 

Teaching others makes me want to swallow a large bottle of bleach 
because the pain of my burning oesophagus would be light relief in 
comparison 

o 

o 

o 

o 

o 

7 

Helping others to understand sums of squares is a great feeling 

o 

o 

o 

o 

o 

8 

1 like control conditions 

o 

o 

o 

o 

o 

9 

1 calculate three ANOVAs in my head before getting out of bed every 
morning 

o 

o 

o 

o 

o 

10 

1 could spend all day explaining statistics to people 

o 

o 

o 

o 

o 

11 

1 like it when people tell me I’ve helped them to understand factor 
rotation 

o 

o 

o 

o 

o 

12 

People fall asleep as soon as 1 open my mouth to speak 

o 

o 

o 

o 

o 

13 

Designing experiments is fun 

o 

o 

o 

o 

o 

14 

I'd rather think about appropriate dependent variables than go to the 
pub 

o 

o 

o 

o 

o 

15 

1 soil my pants with excitement at the mere mention of factor analysis 

o 

o 

o 

o 

o 

16 

Thinking about whether to use repeated or independent measures 
thrills me 

o 

o 

o 

o 

o 

17 

1 enjoy sitting in the park contemplating whether to use participant 
observation in my next experiment 

o 

o 

o 

o 

o 

18 

Standing in front of 300 people in no way makes me lose control of my 
bowels 

o 

o 

o 

o 

o 

19 

1 like to help students 

o 

o 

o 

o 

o 

20 

Passing on knowledge is the greatest gift you can bestow on an 
individual 

o 

o 

o 

o 

o 

21 

Thinking about Bonferroni corrections gives me a tingly feeling in my 
groin 

o 

o 

o 

o 

o 

22 

1 quiver with excitement when thinking about designing my next 
experiment 

o 

o 

o 

o 

o 

23 

1 often spend my spare time talking to the pigeons ... and even they die 
of boredom 

o 

o 

o 

o 

o 

24 

1 tried to build myself a time machine so that 1 could go back to the 

1930s and follow Fisher around on my hands and knees licking the 
floor on which he’d just trodden 

o 

o 

o 

o 

o 

25 

1 love teaching 

o 

o 

o 

o 

o 

26 

1 spend lots of time helping students 

o 

o 

o 

o 

o 

27 

1 love teaching because students have to pretend to like me or they'll 
get bad marks 

o 

o 

o 

o 

o 

28 

My cat is my only friend 

o 

o 

o 

o 

o 


FIGURE 17.9 

The Teaching 
of Statistics 
for Scientific 
Experiments 
- Revised 
(TOSSE-R) 




810 


DISCOVERING STATISTICS USING R 



• Task 2: Dr Sian Williams (University of Brighton) devised a questionnaire to measure 
organizational ability. She predicted five factors to do with organizational ability: 
(1) preference for organization; (2) goal achievement; (3) planning approach; (4) 
acceptance of delays; and (5) preference for routine. These dimensions are theoreti¬ 
cally independent. Williams’s questionnaire (Figure 17.10) contains 28 items using a 
7-point Likert scale (1 = strongly disagree, 4 = neither, 7 = strongly agree). She gave 
it to 239 people. Run a principal components analysis on the data in Williams.dat. © 

Answers can be found on the companion website. 


FIGURE 17.10 

Williams’s 

organizational 

ability 

questionnaire 


1 

1 like to have a plan to work to in everyday life 

2 

1 feel frustrated when things don’t go to plan 

3 

1 get most things done in a day that 1 want to 

4 

1 stick to a plan once 1 have made it 

5 

1 enjoy spontaneity and uncertainty 

6 

1 feel frustrated if 1 can't find something 1 need 

7 

1 find it difficult to follow a plan through 

8 

1 am an organized person 

9 

1 like to know what 1 have to do in a day 

10 

Disorganized people annoy me 

11 

1 leave things to the last minute 

12 

1 have many different plans relating to the same goal 

13 

1 like to have my documents filed and in order 

14 

1 find it easy to work in a disorganized environment 

15 

1 make ‘to do' lists and achieve most of the things on it 

16 

My workspace is messy and disorganized 

17 

1 like to be organized 

18 

Interruptions to my daily routine annoy me 

19 

1 feel that 1 am wasting my time 

20 

1 forget the plans 1 have made 

21 

1 prioritize the things 1 have to do 

22 

1 like to work in an organized environment 

23 

1 feel relaxed when 1 don't have a routine 

24 

1 set deadlines for myself and achieve them 

25 

1 change rather aimlessly from one activity to another during the day 

26 

1 have trouble organizing the things 1 have to do 

27 

1 put tasks off to another day 

28 

1 feel restricted by schedules and plans 


Further reading 


Cortina, J. M. (1993). What is coefficient alpha? An examination of theory and applications. Journal 
of Applied Psychology, 78, 98-104. (A very readable paper on Cronbach’s a.) 

Dunteman, G. E. (1989). Principal components analysis. Sage University Paper Series on Quantitative 
Applications in the Social Sciences, 07-069. Newbury Park, CA: Sage. (This monograph is quite 
high level but comprehensive.) 
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Pedhazur, E., & Schmelkin, L. (1991). Measurement, design and analysis. Hillsdale, NJ: Erlbaum. 

(Chapter 22 is an excellent introduction to the theory of factor analysis.) 

Tabachnick, B. G. & Fidell, L. S. (2007). Using multivariate statistics (5th ed.). Boston: Allyn & 
Bacon. (Chapter 13 is a technical but wonderful overview of factor analysis.) 


Interesting real research 


Nichols, L. A., & Nicki, R. (2004). Development of a psychometrically sound internet addiction 
scale: A preliminary step. Psychology of Addictive Behaviors, 18(4), 381-384. 





Categorical data 



FIGURE 18.1 

Midway through 
writing the second 
edition of my 
SPSS book, things 
had gone a little 
strange 



18.1. What will this chapter tell me? © 


We discovered in the previous chapter that I wrote a book. An earlier edition of this book, 
which focused on SPSS. There are a lot of good things about writing books. The main ben¬ 
efit is that your parents are impressed. Well, they’re not that impressed actually, because 
they think that a good book sells as many copies as Harry Potter and that people should 
queue outside bookshops for the latest enthralling instalment of Discovering Statistics .... 
My parents are, consequently, quite baffled about how this book is seen as reasonably suc¬ 
cessful, yet I don’t get invited to dinner by the Queen. Nevertheless, given that my family 
don’t really understand what I do, books are tangible proof that I do something. The size 
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of this book and the fact it has equations in it is an added bonus because it makes me look 
cleverer than I actually am. However, there is a price to pay, which is immeasurable mental 
anguish. In England we don’t talk about our emotions, because we fear that if they get out 
into the open, civilization as we know it will collapse, so I definitely will not mention that 
the writing process for the second edition of my SPSS book was so stressful that I came 
within one of Fuzzy’s whiskers of a total meltdown. It took me two years to recover, just in 
time to start thinking about the third edition and an adaptation for R. Still, it was worth it 
because the feedback suggests that some people find the books vaguely useful. Of course, 
the publishers don’t care about helping people, they care only about raking in as much 
cash as possible to feed their cocaine habits and champagne addictions. Therefore, they 
are obsessed with sales figures and comparisons with other books. They have databases 
that have sales figures of this book and its competitors in different ‘markets’ (you are not 
a person, you are a ‘consumer’, and you don’t live in a country, you live in a ‘market’) and 
they gibber and twitch at their consoles creating frequency distributions (with 3-D effects) 
of these values. The data they get are frequency data (the number of books sold in a certain 
timeframe). Therefore, if they wanted to compare sales of this book to its competitors, in 
different countries, they would need to read this chapter because it’s all about analysing 
data, for which we know only the frequency with which events occur. Of course, they 
won’t read this chapter, but they should ... 


18.2. Packages used in this chapter © 


We’ll use the gmodels package in this chapter to do chi-square, and A4ASS for loglinear 
analysis. MASS should be installed by default in R so you should only need to install 
gmodels by executing: 

install.packages("gmodels") 

However, you will need to load both packages by executing: 
library(gmodels); library(MASS) 


18.3. Analysing categorical data © 


Sometimes, we are interested not in test scores, or continuous measures, but in categori¬ 
cal variables. These are not variables involving cats (although the examples in this chapter 
might convince you otherwise), but are what we have mainly used as grouping variables. 
They are variables that describe categories of entities (see section 1.5.1.2). We’ve come 
across these types of variables in virtually every chapter of this book. There are different 
types of categorical variable (see section 6.5.7), but in theory a person, or case, should 
fall into only one category. Good examples of categorical variables are gender (with few 
exceptions, people can be only biologically male or biologically female), 1 pregnancy (a 
woman can be only pregnant or not pregnant) and voting in an election (as a general rule 
you are allowed to vote for only one candidate). In all cases (except logistic regression) so 
far, we’ve used such categorical variables to predict some kind of continuous outcome, but 
there are times when we want to look at relationships between lots of categorical variables. 


1 Before anyone rips my arms from their sockets and beats me around the head with them, I am aware that 
numerous chromosomal and hormonal conditions exist that complicate the matter. Also, people can have a 
different gender identity than their biological gender. 
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This chapter looks at two techniques for doing this. We begin with the simple case of two 
categorical variables and discover the chi-square statistic (which we’re not really discover¬ 
ing because we’ve unwittingly come across it countless times before). We then extend this 
model to look at relationships between several categorical variables. 


18.4. Theory of analysing categorical data © 


We will begin by looking at the simplest situation that you could encounter; that is, analys¬ 
ing two categorical variables. If we want to look at the relationship between two categori¬ 
cal variables then we can’t use the mean or any similar statistic because we don’t have any 
variables that have been measured continuously. Trying to calculate the mean of a categori¬ 
cal variable is completely meaningless because the numeric values you attach to different 
categories are arbitrary, and the mean of those numeric values will depend on how many 
members each category has. Therefore, when we’ve measured only categorical variables, 
we analyse frequencies. That is, we analyse the number of things that fall into each combi¬ 
nation of categories. If we take an example, a researcher was interested in whether animals 
could be trained to line-dance. He took 200 cats and tried to train them to line-dance by 
giving them either food or affection as a reward for dance-like behaviour. At the end of the 
week he counted how many animals could line-dance and how many could not. There are 
two categorical variables here: Training (the animal was trained using either food or affec¬ 
tion, not both) and Dance (the animal either learnt to line-dance or it did not). By combin¬ 
ing categories, we end up with four different categories. All we then need to do is to count 
how many cats fall into each category. We can tabulate these frequencies as in Table 18.1 
(which shows the data for this example), and this is known as a contingency table. 


Table 18.1 Contingency table showing how many cats will line-dance after being trained with 
different rewards 


Training 



Food as reward 

Affection as reward 

Total 

Could they dance? Yes 

28 

48 

76 

No 

10 

114 

124 

Total 

38 

162 

200 


18 . 4 . 1 . 


Pearson’s chi-square test © 


If we want to see whether there’s a relationship between two categorical variables (i.e., 
does the number of cats that line-dance relate to the type of training used?) we can use the 
Pearson’s chi-square test (Fisher, 1922; Pearson, 1900). This is an extremely elegant statis¬ 
tic based on the simple idea of comparing the frequencies you observe in certain categories 
to the frequencies you might expect to get in those categories by chance. All the way back 
in Chapters 2, 7 and 10 we saw that if we fit a model to any set of data we can evaluate 
that model using a very simple equation (or some variant of it): 

Deviation = £ (observed -model) 2 
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This equation was the basis of our sums of squares in regression and ANOVA. Now, 
when we have categorical data we can use the same equation. There is a slight variation 
in that we divide by the model scores as well, which is actually much the same process as 
dividing the sum of squares by the degrees of freedom in ANOVA. So, basically, what we’re 
doing is standardizing the deviation for each observation. If we add all of these standard¬ 
ized deviations together the resulting statistic is Pearson’s chi-square (x 2 ) given by: 


x 2 =X 


(observed^- -model^j 
modeL 


(18.1) 


in which i represents the rows in the contingency table and j represents the columns. The 
observed data are, obviously, the frequencies in Table 18.1, but we need to work out what 
the model is. In ANOVA the model we use is group means, but as I’ve mentioned we 
can’t work with means when we have only categorical variables so we work with frequen¬ 
cies instead. Therefore, we use ‘expected frequencies’. One way to estimate the expected 
frequencies would be to say ‘well, we’ve got 200 cats in total, and four categories, so the 
expected value is simply 200/4 = 50’. This would be fine if, for example, we had the same 
number of cats that had affection as a reward and food as a reward; however, we didn’t: 
38 got food and 162 got affection as a reward. Likewise there are not equal numbers that 
could and couldn’t dance. To take account of this, we calculate expected frequencies for 
each of the cells in the table (in this case there are four cells) and we use the column and 
row totals for a particular cell to calculate the expected value: 


row total, x column total, 

Model = £.. =- - - 

' ' n 

where n is simply the total number of observations (in this case 200). We can calculate 
these expected frequencies for the four cells within our table (row total and column total 
are abbreviated to RT and CT respectively): 


M °del Food! Yes 

ll 

ps 

5j 

x CT Food 

n 

76 x 38 _ 
200 

.44 

M ° del Food,No 

ii 

pa 

z H 

o 

x CT Food 

n 

124X38 =23.56 

200 

Model Affection Yes 

II 

& 

n 

* ^ ^Affection 

n 

76x162 

200 

= 61.56 

Model A£fectionNo 

i 

po 

Z H 

O 

* ^"^Affection 

124x162 

-inn 44 


n 

200 



Given that we now have these model values, all we need to do is take each value in each 
cell of our data table, subtract from it the corresponding model value, square the result, 
and then divide by the corresponding model value. Once we’ve done this for each cell in 
the table, we just add them up! 


2 _ (28-14.44) 2 (10-23.56) 2 (48-61.56) 2 (114-100.44) 2 

X “ 14.44 + 23.56 + 61.56 + 100.44 

(13.56) 2 (-13.56) 2 (-13.568) 2 (13.56) 2 

14.44 + 23.56 + + 

= 12.73 + 7.80 + 2.99 + 1.83 
= 25.35 


61.56 


100.44 
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This statistic can then be checked against a distribution with known properties. All we 
need to know is the degrees of freedom and these are calculated as (r — 1 )(c — 1) in which 
r is the number of rows and c is the number of columns. Another way to think of it is the 
number of levels of each variable minus one multiplied. In this case we get df = (2 — 1) 
(2 — 1) = 1. If you were doing the test by hand, you would find a critical value for the chi- 
square distribution with df= 1 and if the observed value was bigger than this critical value 
you would say that there was a significant relationship between the two variables. These 
critical values are produced in the Appendix, and for df= 1 the critical values are 3.84 (p = 
.05) and 6.63 (p = .01), and so because the observed chi-square is bigger than these values 
it is significant at p < .01. However, if you use R, it will simply produce an estimate of the 
precise probability of obtaining a chi-square statistic at least as big as (in this case) 25.35 if 
there were no association in the population between the variables. 


18 . 4 . 2 . 


Fisher’s exact test © 


There is one problem with the chi-square test, which is that the sampling distribution of 
the test statistic has an approximate chi-square distribution. The larger the sample is, the 
better this approximation becomes, and in large samples the approximation is good enough 
to not worry about the fact that it is an approximation. However, in small samples the 
approximation is not good enough, making significance tests of the chi-square distribution 
inaccurate. This is why you often read that to use the chi-square test the expected frequen¬ 
cies in each cell must be greater than 5 (see section 18.5). When the expected frequencies 
are greater than 5, the sampling distribution is probably close enough to a perfect chi- 
square distribution for us not to worry. However, when the expected frequencies are too 
low, it probably means that the sample size is too small and that the sampling distribution 
of the test statistic is too deviant from a chi-square distribution to be of any use. 

Fisher came up with a method for computing the exact probability of the chi-square sta¬ 
tistic that is accurate when sample sizes are small. This method is called Fisher’s exact test 
(Fisher, 1922) even though it’s not so much a test as a way of computing the exact probabil¬ 
ity of the chi-square statistic. This procedure is normally used on 2 X 2 contingency tables 
(i.e., two variables each with two options) and with small samples. However, it can be used 
on larger contingency tables and with large samples, but on larger contingency tables it 
becomes computationally intensive and you might find R taking a long time to give you an 
answer. In large samples there is really no point because it was designed to overcome the 
problem of small samples, so you don’t need to use it when samples are large. 


18 . 4 . 3 . 


The likelihood ratio © 


An alternative to Pearson’s chi-square is the likelihood ratio statistic, which is based on 
maximum-likelihood theory. The general idea behind this theory is that you collect some 
data and create a model for which the probability of obtaining the observed set of data is 
maximized, then you compare this model to the probability of obtaining those data under 
the null hypothesis. The resulting statistic is, therefore, based on comparing observed fre¬ 
quencies with those predicted by the model: 


L/ 2 = 2^ observed- In 


( 


observed.. 


model 


•i / 


(18.2) 
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in which i and j are the rows and columns of the contingency table and In is the natural 
logarithm (this is the standard mathematical function that we came across in Chapter 8, 
and you can find it on your calculator, usually labelled as In or log ). Using the same model 
and observed values as in the previous section, this would give us: 


Lx 2 = 2 


28 x In 


28 

14.44 


+10 x In 


10 


23.56 


+ 48xln 


48 

61.56 


+114 x In 


= 2 [28 x 0.662 +10 x -0.857 + 48 x -0.249 +114 x 0.127] 
= 2[18.54-8.57-11.94+ 14.44] 

= 24.94 


114 V 
100.44 J 


As with Pearson’s chi-square, this statistic has a chi-square distribution with the same 
degrees of freedom (in this case 1). As such, it is tested in the same way: we could look 
up the critical value of chi-square for the number of degrees of freedom that we have. As 
before, the value we have here will be significant because it is bigger than the critical values 
of 3.84 (p = .05) and 6.63 (p = .01). For large samples this statistic will be roughly the same 
as Pearson’s chi-square, but is preferred when samples are small. 


18 . 4 . 4 . 


Yates’s correction (D 


When you have a 2 x 2 contingency table (i.e., two categorical variables each with two 
categories) then Pearson’s chi-square tends to produce significance values that are too 
small (in other words, it tends to make a Type I error). Therefore, Yates suggested a cor¬ 
rection to the Pearson formula (usually referred to as Yates’s continuity correction). The 
basic idea is that when you calculate the deviation from the model (the observed.. — mod¬ 
el^. in equation (18.1)) you subtract 0.5 from the absolute value of this deviation before 
you square it. In plain English, this means you calculate the deviation, ignore whether it 
is positive or negative, subtract 0.5 from the value and then square it. Pearson’s equation 
then becomes: 

, (|observed^-model i; |-0.5) 2 

* ^ model- 


For the data in our example this just translates into : 

2 (13.56-0.5) 2 (13.56 -0.5) 2 (13.56 -0.5) 2 (13.56 -0.5) 2 

X “ 14.44 + 23.56 + 61.56 + 100.44 

= 11.81 + 7.24 + 2.77 + 1.70 
= 23.52 

The key thing to note is that it lowers the value of the chi-square statistic and, therefore, 
makes it less significant. Although this seems like a nice solution to the problem there is a 
fair bit of evidence that this overcorrects and produces chi-square values that are too small! 
Flowell (2006) provides an excellent discussion of the problem with Yates’s correction for 
continuity, if you’re interested; all I will say is that, although it’s worth knowing about, it’s 
probably best ignored. 
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18.5. Assumptions of the chi-square test © 


It should be obvious that the chi-square test does not rely on assumptions such as having 
continuous normally distributed data like most of the other tests in this book (categorical 
data cannot be normally distributed because they aren’t continuous). However, the chi- 
square test still has two important assumptions: 

• Pretty much all of the tests we have encountered in this book have made an assump¬ 
tion about the independence of data and the chi-square test is no exception. For 
the chi-square test to be meaningful it is imperative that each person, item or entity 
contributes to only one cell of the contingency table. Therefore, you cannot use a chi- 
square test on a repeated-measures design (e.g., if we had trained some cats with food 
to see if they would dance and then trained the same cats with affection to see if they 
would dance, we couldn’t analyse the resulting data with Pearson’s chi-square test). 

• The expected frequencies should be greater than 5. Although it is acceptable in larger 
contingency tables to have up to 20% of expected frequencies below 5, the result is a 
loss of statistical power (so the test may fail to detect a genuine effect). Even in larger 
contingency tables no expected frequencies should be below 1. Howell (2006) gives 
a nice explanation of why violating this assumption creates problems. If you find 
yourself in this situation consider using Fisher’s exact test (section 18.4.2). 

Finally, although it’s not an assumption, it seems fitting to mention in a section in which 
a gloomy and foreboding tone is being used that proportionately small differences in cell 
frequencies can result in statistically significant associations between variables if the sample 
is large enough (although it might need to be very large indeed). Therefore, we must look 
at row and column percentages to interpret any effects we get. These percentages will 
reflect the patterns of data far better than the frequencies themselves (because these fre¬ 
quencies will be dependent on the sample sizes in different categories. 


18.6. Doing the chi-square test using R © 


There are two ways in which categorical data can be entered: enter the raw scores, or enter 
weighted cases. We’ll look at both in turn. 


18 . 6 . 1 . 


Entering data: raw scores © 


If we input the raw scores, it means that every row of the data editor represents each entity 
about which we have data (in this example, each row represents a cat). So, you would 
create two codings (Training and Dance). Training would contain two values - one to indi¬ 
cate food was a reward, and one to indicate affection was a reward. Dance would contain 
Yes, or No, depending on whether the cat danced. There were 200 cats in all and so there are 
200 rows of data. This is how the data are stored in cats.dat. You can load this data file by 
setting your working directory to the location of the file (see section 3.4.4) and executing: 

catData<-read.delim("cats.dat", header = TRUE) 

The resulting data look like this (heavily edited because you don’t need to see all 200 rows 
to get the idea): 
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Training Dance 


1 

Food 

as 

Reward 

Yes 

2 

Food 

as 

Reward 

Yes 

3 

Food 

as 

Reward 

Yes 

4 

Food 

as 

Reward 

Yes 

5 

Food 

as 

Reward 

Yes 

29 

Food 

as 

Reward 

No 

30 

Food 

as 

Reward 

No 

31 

Food 

as 

Reward 

No 

32 

Food 

as 

Reward 

No 

33 

Food 

as 

Reward 

No 

39 

Affection 

as 

Reward 

Yes 

40 

Affection 

as 

Reward 

Yes 

41 

Affection 

as 

Reward 

Yes 

42 

Affection 

as 

Reward 

Yes 

43 

Affection 

as 

Reward 

Yes 

87 

Affection 

as 

Reward 

No 

88 

Affection 

as 

Reward 

No 

89 

Affection 

as 

Reward 

No 

90 

Affection 

as 

Reward 

No 

91 

Affection 

as 

Reward 

No 



SELF-TEST 

s Using what you have learnt about data entry in R, 
can you work out how you would enter these data 
directly into R? 



18 . 6 . 2 . 


Entering data: the contingency table © 


An alternative method of data entry is to enter the contingency table directly. This is much 
easier if someone tells you that there were 38 cats that were given food as a reward, and 28 
of them danced, and 162 cats given affection as a reward, and 48 of them danced. There 
are several ways to enter frequency data in this way, one of which is to create two variables, 
one of which contains the two values for those given food, and one of which contains the 
two values for those given affection, and then combine them together, using the cbind() 
function. For example, for the current data we would execute: 

food <- c(10, 28) 

affection <- c(114, 48) 

catsTable <- cbind(food, affection) 

The resulting data look like this: 

food affection 
[1,] 10 114 

[2,] 28 48 
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The columns represent whether the training was done using food or affection, and the 
rows show whether the animal danced (the second row) or not (the first row). As you can 
see, this method of data entry is a fair bit easier than entering the raw data. 


18 . 6 . 3 . 


Running the analysis with R Commander © 


As always, import the data, using Data=>Import data=>from text file, clipboard, or URL... 
(see section 3.7.3) click on [ ok J and choose the file cats.dat. To do a chi-square test, select 
Statistics=>Proportions=>Two-sample proportions test... to open the dialog box in Figure 
18.2. Pick the variable in the list labelled Groups (pick one) that defines the difference 
between the groups (in our case, that’s Training), and the outcome variable from the list 
labelled Response Variable (pick one) (in our case, that’s whether or not the cat danced, 
Dance). 

You should probably leave the default option of a two-sided test as it is (although if you 
have predicted a direction of the effect you could choose to test whether or not the differ¬ 
ence will be bigger (Difference > 0) or smaller (Difference < 0) than zero. You can choose 
the type of test: Normal approximation produces the Pearson chi-square test whereas 
Normal approximation with continuity correction produces the Yates correction to the 
chi-square test. 

When you click on 1 ok Output 18.1 is produced. The first part of the output shows 
a table that gives the percentages for each of the types of training; so when affection was 
used as a reward, 29.6% of the cats danced, but when food was used as a reward, 73.7% 
of the cats danced. 

The second part of the output shows the chi-squared test. The chi-square value of 25.36, 
with 1 degree of freedom, is highly significant because the p-value is less than .05 (in fact, 
it is .000000477). You can rerun the analysis with Yates’s correction. 


FIGURE 18.2 

The chi-square 
test using R 
Commander 
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Dance 

Training No Yes Total Count 

Affection as Reward 70.4 29.6 100 162 

Food as Reward 26.3 73.7 100 38 

2-sample test for equality of proportions without 
continuity 

correction 

data: .Table 

X-squared = 25.3557, df = 1, p-value = 4.767e-07 
alternative hypothesis: two.sided 
95 percent confidence interval: 

0.2838731 0.5972186 
sample estimates: 

prop 1 prop 2 
0.7037037 0.2631579 

Output 18.1 


18 . 6 . 4 . 


Running the analysis using R © 


To run a chi-square analysis we can use the CrossTable() function (note the capital letters), 
in the gmodels package. This function takes two general forms depending on whether or 
not we’re inputting the raw data or a contingency table. For raw data, the function takes 
the basic form: 

CrossTable(predictor, outcome, fisher = TRUE, chisq = TRUE, expected = TRUE, 
sresid = TRUE, format = "SAS"/"SPSS") 

and for a contingency table: 

CrossTable(contingencyTable, fisher = TRUE, chisq = TRUE, expected = TRUE, 
sresid = TRUE, format = "SASV'SPSS") 

These commands are identical except for how the variables are specified. In the first case we 
enter the name of the predictor (in this case Training) and the outcome (in this case Dance) 
variables and in the second case we enter the name of the contingency table dataframe (in 
this case catsTable). There are several other options we can ask for: we obtain the chi-square 
test by adding chisq = TRUE, and the Fisher exact test by adding fisher = TRUE. We can 
also add expected = TRUE to see the expected values of each cell of the contingency table, 
which is useful to ensure that the related assumption has been satisfied. We use the sresidu- 
als option to obtain standardized residuals, which are useful for breaking down a significant 
effect if we get one, and for these residuals to be displayed we need to include the option 
format = “SPSS”. These options and some others are described in R’s Souls’ Tip 17.2. 
Therefore, to run the chi-square test on our cat data, we could execute: 

CrossTable(catsData$Training, catsData$Dance, fisher = TRUE, chisq = TRUE, 
expected = TRUE, sresid = TRUE, format = "SPSS") 

on the raw scores (i.e., the catsData dataframe), or: 

CrossTable(catsTable, fisher = TRUE, chisq = TRUE, expected = TRUE, sresid = 
TRUE, format = "SPSS") 

on the contingency table data (i.e., the catsTable dataframe). These options will give us the 
basic chi-square test ( chisq = TRUE), Fisher’s exact test ( fisher = TRUE), expected values 
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Other CrossTablef) options (D 


The CrossTable() function has several other options that you might find useful (execute ?CrossTable to investigate 
further): 


• digits =x: You can specify the number of digits after the decimal point for cell proportions in the output. 

• mcnemar= TRUE: This option will produce the results of McNemar’s test. This tests differences between 
two related groups when nominal data have been collected. It’s typically used when we’re looking for 
changes in people’s scores and it compares the proportion of people who changed their response in one 
direction (i.e., scores increased) to those who changed in the opposite direction (scores decreased). So, 
this test needs to be used when we’ve got two related dichotomous variables. 

• prop.c = FALSE: This stops the column proportions being displayed in the output. 

• prop.t = FALSE: This stops the total proportions being displayed in the output. 

• prop.chisq = FALSE: This stops the chi-square proportions being displayed in the output. 

• resid = TRUE: This produces Pearson residuals in the resulting contingency table. 

• sresid = TRUE: This produces standardized residuals in the resulting contingency table. 

• asresid = TRUE: This produces adjusted standardized residuals in the resulting contingency table. 

• Format = “SAS’’/“SPSS": This sets the output to mimic that of SAS (default) or SPSS. To see residuals you 
need to set the format to SPSS. 


(expected = TRUE) and standardized residuals (sresid = TRUE in combination with format 
= “SPSS”). 


18 . 6 . 5 . 


Output from the CrossTabLeQ function © 


The output produced by R first shows the contingency table (Output 18.2). For each com¬ 
bination of training (food or affection) we are given (in this order) the number of cats, 
the expected frequency, the chi-square contribution of the cell, the row proportion, the 
column proportion, the total proportion, and the standardized residual. You can adapt the 
command you execute to produce a slightly simpler version of this table if you prefer - see 
R’s Souls’ Tip 18.2. 



Simplifying the contingency table © 


The table in Output 18.2 has rather more information than we want: we probably need only the numbers, and the 
proportion of cats that danced for each type of training. That is, we probably don’t need the column proportion, 
total proportion and chi-square contribution. We can remove these by adding prop.c=FALSE, prop.t=FALSE, prop. 
chisq=FALSE to the command (R’s Souls’ Tip 18.1), so if you find the table confusing execute: 


CrossTable(catsData$Training, catsData$Dance, fisher = TRUE, chisq = TRUE, expected 
= TRUE, prop.c = FALSE, prop.t = FALSE, prop.chisq = FALSE sresid = TRUE, format = 
"SPSS") 


This version of the command will result in a simpler table in the output. 
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Cell Contents 


Count | 
Expected Values | 
Chi-square contribution | 
Row Percent | 
Column Percent | 
Total Percent | 
Std Residual | 


Total Observations in Table: 200 



catsData$Dance 


catsData$Training 

Yes 

No 

Row Total 

Food as Reward 

28 

10 

38 


14.440 

23.560 



12.734 

7.804 



73.684% 

26.316% 

19.000% 


36.842% 

8.065% 



14.000% 

5.000% 



3.568 

-2.794 


Affection as Reward 

48 

114 

162 


61.560 

100.440 



2.987 

1.831 



29.630% 

70.370% 

81.000% 


63.158% 

91.935% 



24.000% 

57.000% 



-1.728 

1.353 


Column Total 

76 

124 

200 


38.000% 

62.000% 







Output 18.2 

The column totals contain the number of cases that fall into each combination of cat¬ 
egories and are rather like our original contingency table. We can see that in total 76 cats 
danced (38% of the total) and of these 28 were trained using food (36.8% of the total that 
danced) and 48 were trained with affection (63.2% of the total that danced). Further, 124 
cats didn’t dance at all (62% of the total) and of those that didn’t dance, 10 were trained 
using food as a reward (8.1% of the total that didn’t dance) and a massive 114 were trained 
using affection (91.9% of the total that didn’t dance). The proportion of cats within the 
Dance variable (i.e., the column proportions) can be read from the fifth row in each cell. We 
can also look at the percentages within the training categories by looking at the fourth rows 
within each cell of the table. These values tell us, for example, that of those trained with 
food as a reward, 73.7% danced and 26.3% did not. Similarly, for those trained with affec¬ 
tion only 29.6% danced compared to 70.4% that didn’t. In summary, when food was used 
as a reward most cats would dance, but when affection was used most cats refused to dance. 

Before moving on to look at the test statistics itself it is vital that we check that the 
assumption for chi-square has been met. The assumption is that in 2 x 2 tables (which is 
what we have here), all expected frequencies should be greater than 5. The second row of 
each cell shows the expected frequencies, which incidentally are the same as we calculated 
earlier; it should be clear that the smallest expected count is 14.44 (for cats that were 
trained with food and did dance). This value exceeds 5 and so the assumption has been 
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met. If you found an expected count lower than 5 the best remedy is to collect more data 
to try to boost the proportion of cases falling into each category. 

Statistics for All Table Factors 

Pearson's Chi-squared test 


Chi A 2 = 25.35569 d.f. = 1 p = 4.767434e-07 

Pearson's Chi-squared test with Yates' continuity correction 


Chi A 2 = 23.52028 d.f. = 1 p = 1.236041e-06 

Fisher's Exact Test for Count Data 


Sample estimate odds ratio: 6.579265 

Alternative hypothesis: true odds ratio is not equal to 1 
p = 1.311709e-06 

95% confidence interval: 2.837773 16.42969 

Alternative hypothesis: true odds ratio is less than 1 
p = 0.9999999 

95% confidence interval: 0 14.25436 

Alternative hypothesis: true odds ratio is greater than 1 
p = 7.7122e-07 

95% confidence interval: 3.193221 Inf 

Minimum expected frequency: 14.44 

Output 18.3 

As we saw earlier, Pearson’s chi-square test examines whether there is an association 
between two categorical variables (in this case the type of training and whether the animal 
danced or not). The next part of the output produced by the CrossTableQ function is the 
chi-square statistic and its significance value (Output 18.3). The Pearson chi-square statistic 
tests whether the two variables are independent. If the p-value is small enough (conven¬ 
tionally less than .05) then we reject the null hypothesis that the variables are independent 
and gain confidence in the hypothesis that they are in some way related. The value of the 
chi-square statistic is given in the output (and the degrees of freedom) as is the significance 
value. The value of the chi-square statistic is 25.356, which is within rounding error of 
what we calculated in section 18.4.1. This value is highly significant (p < .001), indicating 
that the type of training used had a significant effect on whether an animal 
would dance. The table included the chi-square contribution for each cell. If 
we were to add these up, we would find that they would sum to the total for 
chi-square, so: 1.831 + 2.987 + 7.804 + 12.734 = 25.356. 

A series of other statistics are also included in the output. The next part is 
the chi-square with Yates’s correction (see section 18.4.4) and its value is the 
same as the value we calculated earlier (23.52). As I mentioned earlier, this test 
is probably best ignored anyway, but it does confirm the result from the main 
chi-square test. 

The final test is the Fisher’s exact test. You only need to look at the first ver¬ 
sion of this, labelled Alternative hypothesis: true odds ratio is not equal to 1 , 
where it gets a p-value of .0000013, which is less than < .001, and therefore 
the Fisher’s exact test also shows that we should reject the null hypothesis. 
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(You might notice that the p-value for the Fisher’s exact test is in between the Pearson chi- 
square and Yates’s corrected chi-square - this is usually the case.) 

The highly significant result indicates that there is an association between the type of 
training and whether the cat danced or not. What we mean by an association is that the 
pattern of responses (i.e., the proportion of cats that danced to the proportion that did not) 
in the two training conditions is significantly different. This significant finding reflects the 
fact that when food is used as a reward, about 74% of cats learn to dance and 26% do not, 
whereas when affection is used, the opposite is true (about 70% refuse to dance and 30% 
do dance). Therefore, we can conclude that the type of training used significantly influ¬ 
ences the cats: they will dance for food but not for love! Having lived with a lovely cat for 
many years now, this supports my cynical view that they will do nothing unless there is a 
bowl of cat-food waiting for them at the end of it! 


18 . 6 . 6 . 


Breaking down a significant chi-square 
test with standardized residuals (D 


Although in a 2 X 2 contingency table like the one we have in this example, where the 
nature of the association can be quite clear from just the cell percentages or counts, in larger 
contingency tables it can be useful to do a finer-grained investigation of the table. In a way, 
you can think of a significant chi-square test in much the same way as a significant interac¬ 
tion in ANOVA: it is an effect that needs to be broken down further. One very easy way to 
break down a significant chi-square test is to use data that we already have - the standard¬ 
ized residual. 

Just like regression, the residual is simply the error between what the model predicts (the 
expected frequency) and the data actually observed (the observed frequency): 

residual- = observed-model- ; - 


in which i and/ represent the two variables (i.e., the rows and columns in the contingency 
table). This is the same as every other residual or deviation that we have encountered in 
this book (compare this equation to, for example, equation (2.4)). To standardize this 
equation, we simply divide by the square root of the expected frequency: 

observed;- - model; 

standardized residual =- , - 

ymodei- 

Does this equation look familiar? Well, it’s basically part of equation (18.1). The only 
difference is that rather than looking at squared deviations, we’re looking at the pure 
deviation. Remember that the rationale for squaring deviations in the first place is simply 
to make them positive so that they don’t cancel out when we add them. The chi-square sta¬ 
tistic is based on adding together values, so it is important that the deviations are squared 
so that they don’t cancel out. However, if we’re not planning to add up the deviations 
or residuals then we can inspect them in their unsquared form. There are two important 
things about these standardized residuals: 

1 Given that the chi-square statistic is the sum of these standardized residuals (sort of), 
then if we want to decompose what contributes to the overall association that the 
chi-square statistic measures, then looking at the individual standardized residuals is 
a good idea because they have a direct relationship with the test statistic. 
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2 These standardized residuals behave like any other (see section 7.7.1.1) in the sense 
that each one is a z-score. This is very useful because it means that just by looking at 
a standardized residual we can assess its significance (see section 1.7.4). As we have 
learnt many times before, if the value lies outside of ±1.96 then it is significant at p < 
.05, if it lies outside ±2.58 then it is significant at p < .01 and if it lies outside ±3.29 
then it is significant at p < .001. 

If you included the sresid = TRUE option in the CrossTable() function (which we encour¬ 
aged you do to) you will find these standardized residuals in each cell in the contingency 
table. In Output 18.2 these residuals are the bottom value within each cell. As such, there 
are four residuals: one for each combination of the type of training and whether the cats 
danced. When food was used as a reward the standardized residual was significant for both 
those that danced (z = 3.57) and those that didn’t dance (z = -2.79) because both values 
are bigger than 1.96 (when you ignore the minus sign). The plus or minus sign tells us 
something about the direction of the effect, as do the counts and expected counts within 
the cells. We can interpret these standardized residuals as follows: when food was used as 
a reward significantly more cats than expected danced, and significantly fewer cats than 
expected did not dance. When affection was used as a reward the standardized residual was 
not significant for both those that danced (z = —1.73) and those that didn’t dance (z = 1.35) 
because they are both smaller than 1.96 (when you ignore the minus sign). This tells us that 
when affection was used as a reward as many cats as expected danced and did not dance. 
In a nutshell, the cells for when food was used as a reward both significantly contribute to 
the overall chi-square statistic. Put another way, the association between the type of reward 
and dancing is mainly driven by when food is a reward. 


18 . 6 . 7 . 


Calculating an effect size (D 


The most common and possibly most useful measure of effect size for categorical data is 
the odds ratio, which we encountered in Chapter 8. Odds ratios are most interpretable 
in 2 x 2 contingency tables and are probably not useful for larger contingency tables. 
However, this isn’t as restrictive as you might think because, as I’ve said more times than I 
care to recall in the GLM chapters, effect sizes are only ever useful when they summarize 
a focused comparison. A 2 x 2 contingency table is the categorical data equivalent of a 
focused comparison! 

The odds ratio in its basic form is simple enough to calculate. If we look at our example, 
we can first calculate the odds that a cat danced given that they had food as a reward. This 
is simply the number of cats that were given food and danced, divided by the number of 
cats given food that didn’t dance: 


, , number that had food and danced 

dancing after food num b er that had food but didn’t dance 

_ 28 
“ 10 
= 2.8 

Next we calculate the odds that a cat danced given that they had affection as a reward. 
This is simply the number of cats that were given affection and danced, divided by the 
number of cats given affection that didn’t dance: 
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odds 


dancing after affection 


number that had affection and danced 
number that had affection but didn’t dance 


48 
“ 114 
= 0.421 


is simply the odds of dancing after food divided by the odds of dancing 


odds c | ;l ncing after food 
odds^ncjng after affection 
2.8 
0.421 
6.65 

us is that if a cat was trained with food the odds of their dancing were 
6.65 times higher than if they had been trained with affection. As you can see, this is an 
extremely elegant and easily understood metric for expressing the effect you’ve got. 

The above description shows the basic odds ratio, which is particularly useful for get¬ 
ting a sense of what the measure represents; however, there are other more sophisticated 
ways to estimate the odds ratio and its associated confidence interval. Luckily, if we include 
fisher = TRUE in our CrossTable() function then the output will include one such method. 
In Output 18.3 we are told that the odds ratio is 6.58 (note that this is slightly smaller 
than our calculation), and that it has a confidence interval of 2.84 to 16.43. Remember 
from Chapter 8 that the important thing is that the confidence interval does not cross 1. 
Remember that a value of 1 means that the odds of dancing after food would be exactly 
the same as dancing after affection, a value less than 1 means that the odds of dancing are 
smaller after food than after affection, and a value greater than 1 means that the odds of 
dancing are greater after food than after affection. Therefore, a 1 is the point at which 
the direction of the effect changes. Therefore, if the confidence interval crosses 1 it means 
that the population value of the observed effect might be in the same direction as in your 
sample, but it could also be in the opposite direction. 


The odds ratio 
after affection: 

odds ratio = 


What this tells 


18 . 6 . 8 . 


Reporting the results of chi-square © 


When reporting Pearson’s chi-square we simply state the value of the test statistic with its 
associated degrees of freedom and the significance value. The test statistic, as we’ve seen, 
is denoted by / 2 . The output tells us that the value of x 2 was 25.36, that the degrees of 
freedom on which this was based were 1, and that it was significant at p < .001. It’s also 
useful to reproduce the contingency table, and my vote would go to quoting the odds ratio 
and its confidence interval too. As such, we could report: 

^ There was a significant association between the type of training and whether or not 
cats would dance X 2 (l) = 25.36, p < .001. This seems to represent the fact that, based 
on the odds ratio, the odds of cats dancing were 6.58 (2.84, 16.43) times higher if 
they were trained with food than if trained with affection. 
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CRAMMING SAM’S TIPS 


The chi-square test 


• If you want to test the relationship between two categorical variables you can do this with the chi-square test. 

• Look at the value of the chi-squared test; if the p-value is less than .05 then there is a significant relationship between your 
two variables. 

• Check to make sure that no expected frequencies are less than 5. 

• Look at the crosstabulation table to work out what the relationship between the variables is. Better still, look out for significant 
standardized residuals (values outside of ±1.96), and calculate the odds ratio. 

• Report the x 2 statistic, the degrees of freedom and the significance value. Also report the contingency table. 



Labcoat Leni’s Real Research 18.1 


Is the black American 
happy? © 


Beckham, A. S. (1929). Journal of Abnormal and Social Psychology, 24, 186-190. 

When I was doing my psychology degree I spent a lot of time reading about the civil rights movement in the 
USA. Although I was supposed to be reading psychology, I became more interested in Malcolm X and Martin 
Luther King Jr. This is why I find Beckham’s 1929 study of black Americans such an interesting piece of research. 
Beckham was a black American academic who founded the psychology laboratory at Howard University, 
Washington, DC, and his wife Ruth was the first black woman ever to be awarded a Ph.D. (also in psychology) 
at the University of Minnesota. The article needs to be placed within the era in which it was published. To put 
some context on the study, it was published 36 years before the Jim Crow laws were finally overthrown by the 
Civil Rights Act of 1964, and in a time when black Americans were segregated, openly discriminated against and 
were victims of the most abominable violations of civil liberties and human rights. For a richer context I suggest 
reading James Baldwin’s superb novel The Fire Next Time. Even the language of the study and the data from it 
are an uncomfortable reminder of the era in which it was conducted. 

Beckham sought to measure the psychological state of black Americans with three questions put to 3443 
black Americans from different walks of life. He asked them whether they thought black Americans were happy, 
whether they personally were happy as a black American, and whether black Americans should be happy. They 
could answer only yes or no to each question. By today’s standards the study is quite simple, and he did no 
formal statistical analysis of his data (Fisher’s article containing the popularized version of the chi-square test 
was published only 7 years earlier in a statistics journal that would not have been read by psychologists). I love 
this study, though, because it demonstrates that you do not need elaborate methods to answer important and 
far-reaching questions; with just three questions, Beckham told the world an enormous amount about very real 
and important psychological and sociological phenomena. 

The frequency data (number of yes and no responses within each employment category) from this 
study are in the file Beckham1929.dat. Labcoat Leni wants you to carry out three chi-square tests 
(one for each question that was asked). What conclusions can you draw? 

Answers are in the additional material on the companion website. 
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18.7. Several categorical variables: 
loglinear analysis © 


So far we’ve looked at situations in which there are only two categorical variables. However, 
often we want to analyse more complex contingency tables in which there are three or more 
variables. For example, what about if we took the example we’ve just used but also collected 
data from a sample of 70 dogs? We might want to compare the behaviour in dogs to that in 
cats. We would now have three variables: Animal (dog or cat), Training (food as reward or 
affection as reward) and Dance (did they dance or not?). This couldn’t be analysed with the 
Pearson chi-square and instead has to be analysed with a technique called loglinear analysis. 


18.7.1. 


Chi-square as regression © 


To begin with, let’s have a look at how our simple chi-square example can be expressed 
as a regression model. Although we already know about as much as we need to about the 
chi-square test, if we want to understand more complex situations life becomes consider¬ 
ably easier if we consider our model as a general linear model (i.e., regression). All of the 
general linear models we’ve considered in this book take the general form of: 

outcome, = (model) + error, 

For example, when we encountered multiple regression in Chapter 7 we saw that this 
model was written as (see equation (7.9)): 



Y i ~ ( b 0 + b l X l, +b 2 X 2, + --- + K X ni ) + S i 


Also, when we came across one-way ANOVA, we adapted this regression model to concep¬ 
tualize our Viagra example, as (see equation (10.2)): 

libido,- = b 0 + b 2 high,- + b 1 low, + e, 

The t-tcst was conceptualized in a similar way. In all cases the same basic equation is used; 
it’s just the complexity of the model that changes. With categorical data we can use the same 
model in much the same way as with regression to produce a linear model. In our current 
example we have two categorical variables: Training (food or affection) and Dance (yes 
they did dance or no they didn’t dance). Both variables have two categories and so we can 
represent each one with a single dummy variable (see section 7.12.1) in which one category 
is coded as 0 and the other as 1. So for training, we could code ‘food’ as 0 and ‘affection’ 
as 1, and we could code the dancing variable as 0 for ‘yes’ and 1 for ‘no’ (see Table 18.2). 


Table 18.2 Coding scheme for dancing cats 


Training 

Dance 

Dummy 

(Training) 

Dummy 

(Dance) 

Interaction 

Frequency 

Food 

Yes 

0 

0 

0 

28 

Food 

No 

0 

1 

0 

to 

Affection 

Yes 

1 

0 

0 

48 

Affection 

No 

1 

1 

f 

114 
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This situation might be familiar if you think back to factorial ANOVA (section 12.3) in 
which we also had two variables as predictors. In that situation we saw that when there are 
two variables the general linear model became (think back to equation (12.1)): 

outcome, = ( b 0 + b x A : + b 2 B t + b 3 AB,) +e, 

in which A represents the first variable, B represents the second and AB represents the 
interaction between the two variables. Therefore, we can construct a linear model using 
these dummy variables that is exactly the same as the one we used for factorial ANOVA 
(above). The interaction term will simply be the training variable multiplied by the Dance 
variable (look at Table 18.2, and if it doesn’t make sense look back to section 12.3 because 
the coding is exactly the same as this example): 

outcome, = (Model) + error, 

(18 3) 

outcome^ = (b 0 + ^Training, + b 2 Dance ; + b 3 Interaction /; ) + e- ; 


However, because we’re using categorical data, to make this model linear we have to actu¬ 
ally use log values (see Chapter 8) and so the actual model becomes: 2 


ln(0,) = ln(model) + ln(e,) 

ln(O i; ) = (b 0 + ^Training, + b 2 Dance ; + ^Interaction) + ln(e /; ) 


(18.4) 


Training, Dance and Interaction can take the values 0 and 1, depending on which com¬ 
bination of categories we’re looking at (Table 18.2). Therefore, to work out what the 
b-values represent in this model we can do the same as we did for the t-test and ANOVA 
and look at what happens when we replace Training and Dance with values of 0 and 1. 
To begin with, let’s see what happens when we look at when Training and Dance are both 
zero. This represents the category of cats that got food reward and did line-dance. When 
we used this sort of model for the t-test and ANOVA the outcomes we used were taken 
from the observed data: we used the group means (e.g., see sections 9.4.2 and 10.2.3). 
However, with categorical variables, means are rather meaningless because we haven’t 
measured anything on an ordinal or interval scale, instead we merely have frequency data. 
Therefore, we use the observed frequencies (rather than observed means) as our outcome 
instead. In Table 18.1 we saw that there were 28 cats that had food for a reward and did 
line-dance. If we use this as the observed outcome then the model can be written as (if we 
ignore the error term for the time being): 


ln(O i; -) = b 0 + ^Training, + h 2 Dance ; + h 3 Interaction i; - 


For cats that had food reward and did dance, the Training and Dance variables and the 
interaction will all be 0 and so the equation reduces down to: 

ln (°Food,Yes ) = b 0 + (hi x 0) + (b 2 x 0) + (b 3 x 0) 

l n (OFood,Yes ) = ^0 
ln(28) = b 0 
b 0 =3.332 


2 Actually, the convention is to denote b Q as 0 and the 6-values as X, but I think these notational changes serve only 
to confuse people so I’m sticking with b because I want to emphasize the similarities to regression and ANOVA. 
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Therefore, b 0 in the model represents the log of the observed value when all of the catego¬ 
ries are zero. As such it’s the log of the observed value of the base category (in this case cats 
that got food and danced). 

Now, let’s see what happens when we look at cats that had affection as a reward and 
danced. In this case, the Training variable is 1 and the Dance variable and the interaction are 
still 0. Also, our outcome now changes to be the observed value for cats that received affection 
and danced (from Table 18.1 we can see the value is 48). Therefore, the equation becomes: 

ln (°Affect,on,Yes) = b 0 + ( b l X X ) + ( b 2 X °) + ( b 3 X °) 

MOAffection,Yes) = b 0 + b l 

b l = ln (OAffection,Yes)-^0 


Remembering that b Q is the expected value for cats that had food and danced, we get: 

b l = l n (OAffection,Yes ) “ WOpood.Yes ) 

= ln(48) — ln(28) 

= 3.871-3.332 
= 0.539 

The important thing is that b 2 is the difference between the log of the observed frequency 
for cats that received affection and danced, and the log of the observed values for cats that 
received food and danced. Put another way, within the group of cats that danced it rep¬ 
resents the difference between those trained using food and those trained using affection. 

Now, let’s see what happens when we look at cats that had food as a reward and did not 
dance. In this case, the Training variable is 0, the Dance variable is 1 and the interaction is again 
0. Our outcome now changes to be the observed frequency for cats that received food but did 
not dance (from Table 18.1 we can see the value is 10). Therefore, the equation becomes: 

ln (°Food,No ) = b o + ( b i x °) + (h x 1) + (b 3 x 0) 

ln (O Food]No ) = 6 0 +6 2 

b 2 = ln(0 FoodNo ) - b 0 

Remembering that b Q is the expected value for cats that had food and danced, we get: 

b 2 = l n (0 Foo d,No ) - MO Foodj Yes ) 

= ln(10) - ln(28) 

= 2.303-3.332 
= -1.029 


The important thing is that b 2 is the difference between the log of the observed frequency 
for cats that received food and danced, and the log of the observed frequency for cats that 
received food and didn’t dance. Put another way, within the group of cats that received food 
as a reward it represents the difference between cats that didn’t dance and those that did. 

Finally, we can look at cats that had affection and danced. In this case, the Training and 
Dance variables are both 1 and the interaction (which is the value of Training multiplied by 
the value of Dance) is also 1. We can also replace b 0 , b v and b 2 , with what we now know 
they represent. The outcome is the log of the observed frequency for cats that received 
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affection but didn’t dance (this expected value is 114 - see Table 18.1). Therefore, the equa¬ 
tion becomes (I’ve used the shorthand of A for affection, F for food, Y for yes, and N for no): 

ln (°A,N) = b 0 + (b l x 1) + (b 2 x 1) + (b 3 x 1) 

HO AtN )=b 0 +b 1 +b 2 +b 3 

ln(0 A! N) = ln(Op jY ) + (ln(0 A Y) ~ ln(0 F Y)) + (1 n (Op N ) — ln(Op Y )) + b 3 

ln(0 Aj N) = l n (0 Aj y)+ WOf.n) — l n (OpY) + b 3 

b 3 = ln(0 AN ) —ln(Op N ) + ln(Op Y ) — l n (0 A ,Y) 

= ln(l 14) - ln(10) + ln(28) - ln(48) 

= 1.895 

So, b 3 in the model really compares the difference between affection and food when the 
cats didn’t dance to the difference between food and affection when the cats did dance. 
Put another way, it compares the effect of Training when cats didn’t dance to the effect of 
Training when they did dance. 

The final model is therefore: 


ln(0 /; ) = 3.332 + 0.539Training -1.029Dance + 1.895Interaction + ln(e ;; ) 



The important thing to note here is that everything is exactly the same as factorial ANOVA 
except that we dealt with log-transformed values (in fact compare this section to section 
12.3 to see just how similar everything is). In case you still don’t believe me that this works 
as a general linear model, I’ve prepared a file called CatRegression.dat, which contains the 
two variables Dance (0 = no, 1 = yes) and Training (0 = food, 1 = affection) and the inter¬ 
action (Interaction). There is also a variable called Observed that contains the observed 
frequencies in Table 18.1 for each combination of Dance and Training. Finally, there is a 
variable called LnObserved, which is the natural logarithm of these observed frequencies 
(remember that throughout this section we’ve dealt with the log observed values). 




SELF-TEST 

s Run a multiple regression analysis using 

CatRegression.dat with LnObserved as the 
outcome, and Training, Dance and Interaction as 

your three predictors. 


Output 18.4 shows the resulting coefficients table from this regression. The important 
thing to note is that the constant, b Q , is 3.332 as calculated above, the beta value for type 
of training, b v is 0.539 and for dance, b 2 , is —1.030, both of which are within rounding 
error of what was calculated above. Also the coefficient for the interaction, b } , is 1.895 as 
predicted. There is one interesting point, though: all of the standard errors are zero (or 
very, very close to zero), or, put differently, there is no error at all in this model (which is 
also why there are no significance tests). This is because the various combinations of coding 
variables completely explain the observed values. This is known as a saturated model, and I 
will return to this point later, so bear it in mind. For the time being, I hope this convinces 
you that chi-square can be conceptualized as a linear model. 
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Coefficients: 



Estimate 

Std. Error 

t value 

Pr(>|t|) 


(Intercept) 

3.332e+00 

1.289e-15 

2.585e+15 

<2e-16 

*** 

Training 

5.390e-01 

1.622e-15 

3.322e+14 

<2e-16 

*** 

Dance 

-1.030e+00 

2.513e-15 

-4.097e+14 

<2e-16 

*** 

Interaction 

1.895e+00 

2.774e-15 

6.83 0e+14 

<2e-16 

*** 


Output 18.4 

OK, this is all very well, but the heading of this section did rather imply that I would 
show you how the chi-square test can be conceptualized as a linear model. Well, basically, 
the chi-square test looks at whether two variables are independent; therefore, it has no 
interest in the combined effect of the two variables, only their unique effect. Thus, we can 
conceptualize chi-square in much the same way as the saturated model, except that we 
don’t include the interaction term. If we remove the interaction term, our model becomes: 


ln(Model /; ) = b 0 + ^Training + £> 2 Dance ; 


With this new model, we cannot predict the observed values like we did for the saturated 
model because we’ve lost some information (namely, the interaction term). Therefore, the 
outcome from the model changes, and therefore the beta-values change too. We saw earlier 
that the chi-square test is based on ‘expected frequencies’. Therefore, if we’re conceptualiz¬ 
ing the chi-square test as a linear model, our outcomes will be these expected values. If you 
look back to the beginning of this chapter you’ll see we already have the expected frequen¬ 
cies based on this model. We can recalculate the beta values based on these expected values: 


ln(£ /? ) = b 0 + (^Training + £> 2 Dance ; 


For cats that had food reward and did dance, the Training and Dance variables will be 0 
and so the equation reduces down to: 


ln ( £ Food,Yes) = b 0 + ( b { X 0) + {b 2 X 0) 

M-EFood.Yes ) = b 0 

b 0 = ln(14.44) 

= 2.67 


Therefore, b Q in the model represents the log of the expected value when all of the catego¬ 
ries are zero. 

When we look at cats that had affection as a reward and danced, the Training variable is 
1 and the Dance variable is still 0. Also, our outcome now changes to be the expected value 
for cats that received affection and danced: 

ln ( £ Affection,Ye S ) = b Q + X 1) + (b 2 X 0) 
ln ( £ Affection > Yes) = ^0 + ^l 

b l = ln ( £ Affection J Yes)-^0 

— ln(-EAffe C ti 0n ,Yes ) - ln(E FoodjY es) 

= ln(61.56)-ln(14.44) 

= 1.45 
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The important thing is that b l is the difference between the log of the expected frequency 
for cats that received affection and did dance and the log of the expected values for cats 
that received food and danced. In fact, the value is the same as the column marginal, that 
is the difference between the total number of cats getting affection and the total number of 
cats getting food: ln(162) — ln(38) = 1.45. Put simply, it represents the main effect of the 
type of Training. 

When we look at cats that had food as a reward and did not dance, the Training variable 
is 0 and the Dance variable is 1. Our outcome now changes to be the expected frequency 
for cats that received food but did not dance: 

ln ( £ Food,No) = b o + ( b i x 0) + { b 2 x 1) 
ln (£ F ood,No) = 60+^2 

b 2 = ln(0 Food;No ) - b 0 

= ln(0 FoodNo ) - ln(0 FoodYes ) 

= ln(23.56)-ln(14.44) 

= 0.49 

Therefore, b 2 is the difference between the log of the expected frequencies for cats that 
received food and didn’t or did dance. In fact, the value is the same as the row marginal, 
that is the difference between the total number of cats that did and didn’t dance: ln(124) - 
ln(76) = 0.49. In simpler terms, it is the main effect of whether or not the cat danced. 

We can double-check all of this by looking at the final cell: 

ln ( £ Affection,No ) = b Q + (fej X 1) + (b 2 X 1) 
ln (£ A ffection > No) = ^0+ fe l+^2 

ln(100.44) = 2.67 +1.45 + 0.49 
4.61 = 4.61 


The final chi-square model is therefore: 

ln(0-) = ln(model) + ln(e ■) 
ln(O f ) = 2.67 + 1.45Training + 0.49Dance + ln(e,) 


We can rearrange this to get some residuals (the error term): 
ln(s,) = ln(0 ; ) - ln(model) 


In this case, the model is merely the expected frequencies that were calculated for the chi- 
square test, so the residuals are the differences between the observed and expected frequencies. 




SELF-TEST 

s To show that this all actually works, run another 
multiple regression analysis using CatRegression. 
dat. This time the outcome is the log of expected 
frequencies (LnExpected) and Training and Dance 

are the predictors (the interaction is not included). 
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This demonstrates how chi-square can work as a linear model, just like regression and 
ANOVA, in which the beta values tell us something about the relative differences in fre¬ 
quencies across categories of our two variables. If nothing else made sense, I want you to 
leave this section aware that chi-square (and analysis of categorical data generally) can be 
expressed as a linear model (although we have to use log values). We can express catego¬ 
ries of a variable using dummy variables, just as we did with regression and ANOVA, and 
the resulting beta values can be calculated in exactly the same way as for regression and 
ANOVA. In ANOVA, these beta values represented differences between the means of a 
particular category compared against a baseline category. With categorical data, the beta 
values represent the same thing, the only difference being that rather than dealing with 
means, we’re dealing with expected values. Grasping this idea (that regression, t-tests, 
ANOVAs and categorical data analysis are basically the same) will help (me) considerably 
in the next section. 



18 . 7 . 2 . 


Loglinear analysis (D 


In the previous section, after nearly reducing my brain to even more of a rotting vegetable 
than it already is trying to explain how categorical data analysis is just another form of 
regression, I ran the data through an ordinary regression in R to prove that I wasn’t talking 
complete gibberish. At the time I rather glibly said ‘oh, by the way, there’s no error in the 
model, that’s odd isn’t it?’ and sort of passed this off by telling you that it was a ‘saturated’ 
model and not to worry too much about it because I’d explain it all later just as soon as I’d 
worked out what the hell was going on. That seemed like a good avoidance tactic at the 
time, but unfortunately I now have to explain what I was going on about. 

To begin with, I hope you’re now happy with the idea that categorical data can be 
expressed in the form of a linear model provided that we use log values (this, incidentally, 
is why the technique we’re discussing is called loglinear analysis). From what you hopefully 
already know about ANOVA and linear models generally, you should also be cosily tucked 
up in bed with the idea that we can extend any linear model to include any amount of pre¬ 
dictors and any resulting interaction terms between predictors. Therefore, if we can repre¬ 
sent a simple two-variable categorical analysis in terms of a linear model, then it shouldn’t 
amaze you to discover that if we have more than two variables this is no problem: we 
can extend the simple model by adding whatever variables and the resulting interaction 
terms. This is all you really need to know. So, just as in multiple regression and ANOVA, 
if we think of things in terms of a linear model, then conceptually it becomes very easy to 
understand how the model expands to incorporate new variables. So, for example, if we 
have three predictors (A, B and C) in ANOVA (think back to section 14.4) we end up with 
three two-way interactions ( AB,AC , BC) and one three-way interaction (ABC). Therefore, 
the resulting linear model of this is just: 

outcome^ = ( b 0 + hjA, + b 2 Bj + b 3 C k + b 4 AB- + b s AC ik + b 6 BCj k + b 7 ABC^ ) +e ;/ - 

In exactly the same way, if we have three variables in a categorical data analysis we get 
an identical model, but with an outcome in terms of logs : 

ln (°#) = ( b o + M< + b i B j + h C k + b 4 AB ij + b 5 AC ik + b 6 BC jk + bjABCijk ) + l n (G;) 

Obviously the calculation of beta values and expected values from the model becomes con¬ 
siderably more cumbersome and confusing, but that’s why we invented computers - so that 
we don’t have to worry about it. Loglinear analysis works on these principles. However, as 
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we’ve seen in the two-variable case, when our data are categorical and we include all of the 
available terms (main effects and interactions) we get no error: our predictors can perfectly 
predict our outcome (the expected values). So, if we start with the most complex model 
possible, we will get no error. The job of loglinear analysis is to try to fit a simpler model 
to the data without any substantial loss of predictive power. Therefore, loglinear analysis 
typically works on a principle of backward elimination (yes, the same kind of backward 
elimination that we can use in multiple regression - see section 7.6.4.3). So we begin with 
the saturated model, and then we remove a predictor from the model and, using this new 
model, we predict our data (calculate expected frequencies, just like the chi-square test) 
and then see how well the model fits the data (i.e., are the expected frequencies close to the 
observed frequencies?). If the fit of the new model is not very different from the more com¬ 
plex model, then we abandon the complex model in favour of the new one. Put another 
way, we assume the term we removed was not having a significant impact on the ability of 
our model to predict the observed data. 

However, we don’t just remove terms randomly, we do it hierarchically. So, we start with 
the saturated model and then remove the highest-order interaction, and assess the effect 
that this has. If removing the interaction term has no effect on the model then it’s obvi¬ 
ously not having much of an effect; therefore, we get rid of it and move on to remove any 
lower-order interactions. If removing these interactions has no effect then we carry on to 
any main effects until we find an effect that does affect the fit of the model if it is removed. 

To put this in more concrete terms, at the beginning of the section on loglinear analysis I 
asked you to imagine we’d extended our training and line-dancing example to incorporate 
a sample of dogs. So, we now have three variables: Animal (dog or cat), Training (food or 
affection) and Dance (did they dance or not?). Just as in ANOVA this results in three main 
effects: 

• Animal 

• Training 

• Dance 

three interactions involving two variables: 

• Animal x Training 

• Animal x Dance 

• Training x Dance 

and one interaction involving all three variables: 

• Animal x Training x Dance 

When I talk about backward elimination, all I mean is that loglinear analysis starts by 
including all of these effects; we then take the highest-order interaction (in this case the 
three-way interaction of Animal x Training x Dance) and remove it. We construct a new 
model without this interaction, and from the model calculate expected frequencies. We 
(well, the computer) then compares these expected frequencies (or model frequencies) to 
the observed frequencies using the standard equation for the likelihood ratio statistic (see 
section 18.4.3). If the new model significantly changes the likelihood ratio statistic, then 
removing this interaction term has a significant effect on the fit of the model and we know 
that this effect is statistically important. If this is the case then we will stop there and say 
that we have a significant three-way interaction! We won’t test any other effects because 
with categorical data all lower-order effects are consumed within higher-order effects. 
If, however, removing the three-way interaction doesn’t significantly affect the fit of the 
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model then we move on to lower-order interactions. Therefore, we look at the Animal x 
Training, Animal x Dance and Training x Dance interactions in turn and construct models 
in which these terms are not present. For each model we again calculate expected values 
and compare them to the observed data using a likelihood ratio statistic. 3 Again, if any one 
of these models does result in a significant change in the likelihood ratio then the term is 
retained and we won’t move on to look at any main effects involved in that interaction 
(so if the Animal x Training interaction is significant it won’t look at the main effects 
of Animal or Training). However, if the likelihood ratio is unchanged then the analysis 
removes the offending interaction term and moves on to look at main effects. 

I mentioned that the likelihood ratio statistic (see section 18.4.3) is used to assess each 
model. From the equation it should be clear how this equation can be adapted to fit any 
model: the observed values are the same throughout, and the model frequencies are sim¬ 
ply the expected frequencies from the model being tested. For the saturated model, this 
statistic will always be 0 (because the observed and model frequencies are the same so the 
ratio of observed to model frequencies will be 1, and ln(l) = 0), but as we’ve seen, in other 
cases it will provide a measure of how well the model fits the observed frequencies. To test 
whether a new model has changed the likelihood ratio, all we need do is to take the likeli¬ 
hood ratio for a model and subtract from it the likelihood statistic for the previous model 
(provided the models are hierarchically structured): 


change LX 


2 

current model 


~Lx 


2 

previous model 


(18.5) 


I’ve tried in this section to give you a flavour of how loglinear analysis works, without 
actually getting too much into the nitty-gritty of the calculations. I’ve tried to show you 
how we can conceptualize a chi-square analysis as a linear model and then relied on what 
I’ve previously told you about ANOVA to hope that you can extrapolate these conceptual 
ideas to understand roughly what’s going on. The curious among you might want to know 
exactly how everything is calculated and to these people I have two things to say: ‘I don’t 
know’ and ‘I know a really good place where you can buy a straitjacket’. If you’re that 
interested then Tabachnick and Fidell (2007) have, as ever, written a wonderfully detailed 
and lucid chapter on the subject, which frankly puts this feeble attempt to shame. Still, 
assuming you’re happy to live in relative ignorance, we’ll now have a look at how to do a 
loglinear analysis. 


18.8. Assumptions in loglinear analysis © 


Loglinear analysis is an extension of the chi-square test and so has similar assumptions; 
that is, an entity should fall into only one cell of the contingency table (i.e., cells of the 
table must be independent) and the expected frequencies should be large enough for a reli¬ 
able analysis. In loglinear analysis with more than two variables it’s all right to have up to 
20% of cells with expected frequencies less than 5; however, all cells must have expected 
frequencies greater than 1. If this assumption is broken the result is a radical reduction in 
test power - so dramatic in fact that it may not be worth bothering with the analysis at all. 
Remedies for problems with expected frequencies are: (1) collapse the data across one of 
the variables (preferably the one you least expect to have an effect!); (2) collapse levels of 
one of the variables; (3) collect more data; or (4) accept the loss of power. 


3 It’s worth mentioning that for every model, the computation of expected values differs, and as the designs get 
more complex, the computation gets increasingly tedious and incomprehensible (at least to me); however, you 
don’t need to know the calculations to get a feel for what is going on. 
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If you want to collapse data across one of the variables then certain things have to be 
considered: 

1 The highest-order interaction should be non-significant. 

2 At least one of the lower-order interaction terms involving the variable to be deleted 
should be non-significant. 

Let’s take the example we’ve been using. Say we wanted to delete the Animal variable. 
Then for this to be valid, the Animal x Training x Dance variable should be non-significant, 
and either the Animal x Training or the Animal x Dance interaction should also be non¬ 
significant. You can also collapse categories within a variable. So, if you had a variable of 
‘season’ relating to spring, summer, autumn and winter, and you had very few observations 
in winter, you could consider reducing the variable to three categories: spring, summer, 
autumn/winter perhaps. However, you should really only combine categories that it makes 
theoretical sense to combine. Finally, some people overcome the problem by simply adding 
a constant to all cells of the table, but there really is no point in doing this because it doesn’t 
address the issue of power. 


18.9. Loglinear analysis using R © 


18 . 9 . 1 . 


Initial considerations © 



Data are entered for loglinear analysis in the same way as for the chi-square test (see sec¬ 
tions 18.6.1 and 18.6.2). The data for the cat and dog example are in the file CatsandDogs. 
dat; load and open this file by setting your working directory to the location of the file and 
executing: 

catsDogs<-read.delim("CatsandDogs.dat", header = TRUE) 
catsDogs 


Notice that the data set has three variables (Animal, Training and Dance) and each one 
contains text representing the different categories of these variables. To begin with, we 
should produce a contingency table of the data. 

The CrossTableQ function cannot cope with three variables in a table. There are a couple 
of ways to deal with this limitation. One way is to use a subset function (section 3.9.2) to 
create two dataframes, and run CrossTable() on each. This is the most useful thing to do if 
you want row and/or column percentages and so we’ll cover that here. However, you can 
also use the table() and xtabs() functions (see Oliver Twisted). 





OLIVER TWISTED 

Please Sir, can I have 
some more ... tables? 


'Fagin has challenged me to steal a gold-coated table from 
under a rich gent’s nose while he eats roast goose from it. I won¬ 
der how I can do it?’ ponders Oliver. ‘Perhaps the table() func¬ 
tion will help me, or maybe xtabs() although that’s probably best 
for tables with adult content.’ OK, Oliver, I’ll explain them to you 
on the companion website, but I think you’ll be disappointed with 
them as aids to your criminal activities. If you ever need to cross- 
tabulate some frequencies though, you’ll be glad you asked. 
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To create the separate dataframes for cats and dogs, we execute: 

justCats = subset(catsDogs, Animal=="Cat") 
justDogs = subset(catsDogs, Animal=="Dog") 

The first command creates a dataframe called justCats which is based on the whole 
dataframe ( catsDogs ) but includes only cases for which the variable Animal is exactly equal 
to the word ‘Cat’. The second command does much the same but selects only dogs. 

Having created these two new dataframes, we can use the CrossTableQ command to 
generate contingency tables for each of them by executing: 

CrossTable(justCats$Training, justCats$Dance, sresid = TRUE, prop.t = FALSE, 
prop.c = FALSE, prop.chisq = FALSE, format = "SPSS") 

CrossTableCjustDogs$Training, justDogs$Dance, sresid = TRUE, prop.t = FALSE, 
prop.c = FALSE, prop.chisq = FALSE, format = "SPSS") 

These commands produce a contingency table of the variables Training and Dance for cats 
(first command) and dogs (second command). These commands probably look quite long, 
but this is only because we have asked to suppress the total proportions ( prop.t = FALSE), 
the column proportions (prop.c = FALSE) and chi-square proportions (prop.chisq = FALSE), 
so that they don’t appear in the output (see R’s Souls’ Tip 18.2). We have also asked for 
standardized residuals (sresid = TRUE in combination with format — “SPSS”) because these 
might come in handy for interpretation. 

Cell Contents 


Count 
Row Percent 
Std Residual 


Total Observations in Table: 200 


justCats$Training 

justCats$DE 

No 

mce 

Yes 

Row Total 

Affection as Reward 

114 

70.370% 

1.353 

48 

29.630% 

-1.728 

162 

81.000% 

Food as Reward 

10 

26.316% 

-2.794 

28 

73.684% 

3.568 

38 

19.000% 

Column Total 

124 

76 

200 


Output 18.5 


The crosstabulation table produced by the CrossTable() function contains the number of 
cases that fall into each combination of categories. The first table (Output 18.5) contains 
the information for cats and is the same information as in Output 18.3, because the data 
are the same (we just added some dogs to the data). Output 18.6 shows the frequencies 
for the dogs; we can summarize the data in a similar way as we did for the cats. In total 49 
dogs danced and of these 20 were trained using food and 29 were trained with affection. 
Further, 21 dogs didn’t dance at all. In summary, a lot more dogs danced than didn’t. Of 
those that had affection as a reward, 80.56% danced compared to 19.44% that didn’t, but 
for those rewarded with food only 58.82% danced compared to 41.18% that didn’t. In 
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short, dogs seem more willing to dance than cats (70% compared to 38%), and seem more 
motivated by affection than cats (81% danced compared to 30% of cats). 

Cell Contents 


Count 
Row Percent 
Std Residual 


Total Observations in Table: 70 


justDogs$Dance 


justDogs$Training 

No 

Yes 

Row Total 

Affection as Reward 

7 

19.444% 

-1.156 

29 

80.556% 

0.757 

36 

51.429% 

Food as Reward 

14 

41.176% 

1.190 

20 

58.824% 

-0.779 

34 

48.571% 

Column Total 

21 

49 

70 


Output 18.6 


18 . 9 . 2 . 


Loglinear analysis as a chi-square test (D 


First we’ll do a loglinear analysis that involves only the cats, and we’ll ignore the dogs. 
We’ll see that loglinear analysis gives the same results as the chi-square test, then we’ll 
generalize the model to dogs and cats, and show how you can do things with the loglinear 
analysis that you can’t do with the chi-square test. 

To do the loglinear analysis, we can use the loglm() function. The easiest way to use this 
function is by entering a contingency table into it; in which case it takes the general form: 

newModelc-loglmf ~ predictors, data = contingencyTable, fit = TRUE) 

In other words, it creates a model object called newModel based on a contingency table 
( contingencyTable ), with a specified list of variables and/or interactions ( predictors). In 
other words, the format is very much like the lm() function that we have used throughout 
the book. 

The first stage, therefore, is to create a contingency table to put into the loglm() func¬ 
tion; we can do this using the xtabs() function. This function takes the general form: 

newTable<-xtabs(~ classifying variables, data = dataFrame) 

In other words, we create a new dataframe called neuJTable, which is based on an existing 
dataframe, and we simply list the variables by which we want to classify cases. In this case, 
we want to look at only the cats, so we’ll use the justCats dataframe that we generated in 
the previous section, and we want to classify cases based on Training and whether they 
danced or not (Dance). Therefore, we could execute: 

catTable<-xtabs(~ Training + Dance, data = justCats) 
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This command will create an object called catTable, which takes the justCats dataframe 
and classifies cases based on levels of the variables Training and Dance. If you look at 
catTable by executing its name, you’ll see that it is a very simple table of counts: 

Dance 

Training No Yes 

Affection as Reward 114 48 

Food as Reward 10 28 

We input this object into loglm(). We are going to run two loglinear analyses. First, we’ll 
run the saturated model. As explained earlier in this chapter, this model will be able to 
reproduce the proportions exactly and hence the chi-square test will be zero, with ap-value 
equal to 1. In the second model, we will remove the interaction effect leaving only the 
effects of Dance and Training. 

When creating loglinear models, we use a formula, in the same way that we have 
throughout this book when using functions like lm() and glm(). We have seen that usually 
we write models as ‘ outcome — predictor(s)’; however, in loglinear analysis there is no 
outcome (dependent) variable because we’re predicting the frequency of cases in different 
combinations of the predictors. Therefore, we don’t include anything on the left of the 
tilde. In general, we’d write something like a + b’ in which a and b are predictor vari¬ 
ables; because we want all of the effects in the saturated model, our model will be —Dance 
+ Training + Training-.Dance. We can, therefore create this model by executing: 4 

catSaturated<-loglm(~ Training + Dance + Training:Dance, data = catTable, 
fit = TRUE) 

In the second model, we remove the interaction effect and have only Training and Dance 
as predictors. In doing so we should be able to predict the proportion of individuals in each 
category by knowing the proportion of cats who were trained each way, and the propor¬ 
tion of cats who danced. So if 19% of the cats were trained with food, and 38% of the cats 
danced, we would expect to know how many cats had food as training and danced, how 
many had food as training and did not dance, how many had affection and danced, and 
how many had affection and did not dance. In fact, we’d expect 38% of the cats that had 
food as training to have danced, and 38% of the cats that had affection as training to have 
danced. We can obtain these expected values by adding fit = TRUE to the loglm() function, 
which tells the command to also calculate the fitted (expected) values. We’ll compare these 
values with the proportions that we have, and we’ll do a significance test to see if they 
differ. (That’s exactly what we do in the chi-square test.) We create this new model in the 
same way as the saturated model, except that we change the name of the model, and we 
omit the interaction term: 

catNoInteraction<-loglm(~ Training + Dance, data = catTable, fit = TRUE) 

Finally, we can create a mosaic plot. A mosaic plot is a graphical representation of fre¬ 
quency data. Essentially, a square is divided up into portions, where the size of each portion 
represents the number of cases (or expected frequencies) relative to the total. Figure 18.3 
shows some examples of mosaic plots. In the top left the big square has been divided up 
into four shaded squares of equal size: the fact the squares are equal size tells us that there 
are an equal number of cats who danced and didn’t dance and were trained with food and 
affection. In the top right, the two squares under ‘affection’ are wider than those under 
‘food’, but the height of the squares on the dance/no dance dimension are equal. This tells 
us that there were more cats trained with affection than food (because on this dimension 


4 Given what you have already learnt about specifying models it should be clear that you could specify the full 
model as follows (because Training'"Dance will include the main effects automatically): 
catSaturated<-loglm(~Training*Dance, data = catTable, fit = TRUE) 
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the squares are wider for affection than food), but the same number of cats danced and 
didn’t dance (because the squares are the same height). In the bottom left we can see that an 
equal number of cats were trained with food and affection (because the boxes are equally 
wide), but more cats danced than didn’t (the boxes are longer for the ‘yes’ category than 
the ‘no’). Finally, the bottom right shows a situation in which more cats were trained with 
affection than food (we can tell because the boxes under ‘affection’ are wider than for 
‘food’), but also for food training more cats danced than not (the box is longer for ‘yes’) 
and for affection training equal numbers of cats danced and did not (the boxes for this 
category are of equal size). Therefore, by looking at mosaic plots we can get an idea of the 
relative frequency of different categories. To do a mosaic plot in R, we can use the mosaic- 
plot() function, which takes the general form: 

mosaicplot(contingencyTable, shade = TRUE, main = "Title") 

Therefore, we need only input a contingency table, then we can optionally provide a title 
by using the main = option, and ask for shading to show which areas of the plot are signifi¬ 
cant by including the shade = option. 

We can create plots of the expected values from our two models: these are stored in a 
variable called fit that is attached to each model, so we can access them using model$fit 


FIGURE 18.3 
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(in which model is the name of the model). For the saturated model, the expected values 
will be the same as the raw data (because the model is a perfect fit of the data), but for the 
second model these expected values will be the same as the expected cell values for the 
chi-square test that we computed at the beginning of the chapter (i.e., 14.44, 23.56, 61.56, 
and 100.44). 

To create a mosaic plot for the saturated model we can simply execute: 

mosaicplot(catSaturated$fit, shade = TRUE, main = "Cats: Saturated Model") 

This command will create a plot of the expected values from the catSaturated model, 
it gives it a title of ‘Cats: Saturated Model’, and shades it to highlight significant areas. 
Similarly, the expected values for the second model can be plotted by executing: 

mosaicplot(catNoInteraction$fit, shade = TRUE, main = "Cats: Expected Values") 


18 . 9 . 3 . 


Output from loglinear analysis as a chi-square test (D 


Output 18.7 shows the output of the saturated model and Output 18.8 the model without 
the interaction term - I’ve cut out some of the boring details that you don’t need to worry 
about. The summary of the saturated model shows that the model is not significant. In fact, 
the chi-square is zero, and the p-value is 1. Remember that these are goodness-of-fit tests, 
which means that they test whether the expected values from the model deviate from the 
observed data. A non-significant result therefore means a good fit. In fact, the statistic is 
0 (and p-value is 1) because this model fits the data perfectly: the expected values are the 
same as the actual data. 

Formula: 

-Training + Dance + Training:Dance 


Statistics: 

X~2 df P(> X A 2) 

Likelihood Ratio 00 1 

Pearson 00 1 

Output 18.7 

In the second model, we drop the interaction term. This term allowed the proportion of 
cats that danced to vary across conditions. In other words, it allowed an association between 
the Dance and Training variables. When this association is removed from the model, the 
model does not fit the data well any more: likelihood ratio and Pearson chi-square test are 
very similar, and both are highly significant, which means that the model deviates signifi¬ 
cantly from the data. In other words, when we remove the association between Dance and 
Training, the model becomes a poor fit to the data. This is what the chi-square test that 
we did at the beginning of the chapter measures: what is the effect of the association or 
interaction between Dance and Training on cell frequencies. Note that the values of the 
likelihood ratio and chi-square test in Output 18.8 are the same as those we computed at 
the start of the chapter. 

Formula: 

-Training + Dance 


Statistics: 


X~2 df 

Likelihood Ratio 24.93159 1 

Pearson 25.35569 1 


P(> X~2) 
5.940113e-07 
4.767434e-07 


Output 18.8 
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FIGURE 18.4 


Cats: Saturated Model 


Mosaic plot 
summarizing the 
Cat Saturated 
Model 


Affection as Reward 


Food as 
Reward 


0 

Q 


0 

> 


CM 

i I 


N 


CO 


1 « 

03 3 

-O 2 

H o 

CO DC 


Training 


Ultimately, we’re trying to find the most parsimonious model that does not deviate sig¬ 
nificantly from the data; in this case this is the saturated model, because the other model is 
significant (and hence a poor fit of the data). Therefore, we interpret the saturated model. 

Figure 18.4 shows the mosaic plot, which summarizes the saturated model for us. The size 
of each rectangle represents the number of cats. The colour and the boundary of each rect¬ 
angle tells us about the residuals. A blue residual with a solid boundary means the residual is 
positive, a red rectangle (on your screen, but dark grey in Figure 18.4) or a dashed boundary 
means the residual is negative. A rectangle gets coloured only if the standardized residual 
is higher than 2, or lower than —2. Recall that a standardized residual is significant if it is 
(approximately) higher than 2, or lower than -2, so the coloured rectangles are significant. 

What the graph shows is that more cats than we would have expected (given the null 
hypothesis) failed to dance when given affection, as shown by the large rectangle on the 
top left with the solid boundary, but that the residual (i.e., the difference between the num¬ 
ber we found and the number we expect, given the null hypothesis) was not significant: 
because the rectangle is white. Similarly, fewer cats than we expected danced, when given 
affection as a reward, as shown by the rectangle on the bottom left. We know it’s fewer, 
because it has a dashed boundary, and we know it’s not significant, because it’s white. 

When it comes to those cats that were rewarded with food, a different story emerges. 
Fewer cats failed to dance when rewarded with food than we would have expected. We can 
see this from the light grey rectangle (pink on your screen) on the top right. We know that 
it’s fewer, because of the dashed boundary, and it is coloured, so it is significant. 

Similarly, more cats danced when rewarded with food than we would have expected, 
given the null hypothesis. We can see this, because the lower right rectangle has a solid 
boundary, and because it is shaded blue, we know that the residual is positive. 

Figure 18.5 shows the mosaic plot of the fitted values for the model without the interac¬ 
tion term: here all the residuals are zero (they are all white boxes with solid boundaries). 
The plot shows that it doesn’t matter what sort of training the cat had, the same proportion 
danced in both groups. 
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FIGURE 18.5 

Mosaic plot of 
the fitted values 
for the Cat Model 
without the 
interaction term 


18 . 9 . 4 . 


Loglinear analysis (D 


Usually when we do a loglinear analysis we have more than two variables - we can do it 
with two (as we’ve seen), but there’s no point, because a chi-square test does the job for us 
and it’s easier and less confusing. When we have more than two variables, we have more 
possible effects than before. Recall that if we have two variables, then we have the main 
effects: Training + Dance, and the interaction effect (Training x Dance). With three vari¬ 
ables, we have: 

1 Main effects: Training + Dance + Animal 

2 Two-way interactions: Because we have three variables, there are three two-way 
interactions: 

a Training x Dance 
b Dance x Animal 
C Training x Animal 

3 Three-way interaction: Training x Dance x Animal 

You can see that as the number of variables increases, the number of effects increases 
even more. With four variables (which I’ll call a, b, c and d because I have no imagination 
whatsoever), we have: 

1 Main effects: 

cl d + b + c + d 

2 Two-way interactions: 
a a x b 
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b axe 
C a x d 
d b x c 
e bxd 
f cxd 

3 Three-way interactions: 
a axbxc 

b axb xd 
C axcxd 
d b x cx d 

4 Four-way interaction: 
a axbxc xd 

The principle is the same however many effects you have: you start with the saturated 
model, and remove effects until the model becomes significant (i.e., significantly deviates 
from the original data). When the model is significant, you go back to the last model that 
was not significant and interpret it (because it will be the best fit you can achieve given the 
available predictor variables). 

First of all we need to generate our contingency table using xtabs(), and we can do this 
by executing: 

CatDogContingencyTable<-xtabs(~ Animal + Training + Dance, data = catsDogs) 

This takes the original dataframe ( catsDogs ) and creates a contingency table based on 
the variables Animal, Training and Dance. The resulting contingency table is stored as 
CatDogContingencyTable, which is what we’ll use in the loglinear analysis; it looks like 
this: 


, , Dance = No 
Training 

Animal Affection as Reward Food as Reward 

Cat 114 10 

Dog 7 14 

, , Dance = Yes 

Training 

Animal Affection as Reward Food as Reward 

Cat 48 28 

Dog 29 20 

We start by estimating the saturated model, which we know will fit the data perfectly 
with a chi-square equal to zero. We’ll call the model caturated because I feel the need for a 
rubbish cat-related pun. We can create this model in the same way as before: 5 

caturated<-loglm(~ Animal*Training*Dance, data = CatDogContingencyTable) 
summary(caturated) 

The first command creates the model called caturated based on all main effects and inter¬ 
actions in the contingency table called CatDogContingencyTable. The second command 


5 I’ve chosen to specify the model as —Animal 'Training*Dance because this will automatically include all of the main 
effects and lower-order interactions, and is less typing than — Animal + Training + Dance + Animal’.Training + 
Animal.’Dance + Dance-.Training + Dance’.Training:Animal 
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summarizes this model; Output 18.9 shows the main statistics, and as we expect it has a 
likelihood ratio of 0, and a p-value of 1, because it is a perfect fit of the data. 

Formula: 

-Animal * Training * Dance 
Statistics: 

X~2 df P(> X A 2) 

Likelihood Ratio 00 1 

Pearson 00 1 

Output 18.9 

Next we’ll fit the model with all of the main effects and two-way interactions. In other 
words, we’ll remove the three-way interaction; because this model tells us the effect of 
removing the three-way interaction we’ll call it tbreeWay. We could create this model by 
respecifying the model with all terms except the three-way interaction: 

threeWay <- loglm(~ Animal + Training + Dance + Animal:Training + Animal:Dance 
+ Dance:Training, data = CatDogContingencyTable) 

This command uses the same format as before to create a model called threeWay. The only 
difference (apart from that we have changed the name of the model) is that the three-way 
interaction isn’t included. This is a lot of typing, so you could also consider using the 
update() function (see R’s Souls’ Tip 7.2). Remember that this function allows us to take 
an existing model and ‘update’ it. In the past we have updated models by adding in new 
variables, but we can also remove them using this function. For example, to remove the 
three-way interaction from the saturated model we would execute: 

threeWay<-update(caturated, -Animal:Training:Dance) 

Remember that the .—. simply means ‘keep the same outcome variable and predictor as 
before’; so, we’ve specified that we want to take the model called caturated, we want to keep 
the same outcomes and predictors as before, but by including ‘-Animal:Training:Dance’ we 
ask to remove the three-way interaction (the minus sign means ‘remove’). We can summa¬ 
rize this model by executing: 

summary(threeWay) 

The pertinent parts of the resulting output are in Output 18.10. The model has a likeli¬ 
hood ratio of 20.30, with 1 df and p < .001. It seems as though this model is a poor fit 
to the data. 


Formula: 

. - Animal + Training + Dance + Animal:Training + Animal:Dance + 
Training:Dance 


Statistics: 


X A 2 df 

Likelihood Ratio 20.30491 1 

Pearson 20.77759 1 


P(> X~2) 
6.603088e-06 
5.158318e-06 


Output 18.10 

To compare models is very easy, we just subtract the likelihood ratios, and the degrees of 
freedom. But we’re kind of lazy, and so we’ll use the anova() function, which will do this 
for us (see section 7.8.4). We can compare the saturated model to the one without the 
three-way interaction by executing: 

anova(caturated, threeWay) 
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The resulting Output 18.11 shows the difference between these models. We’re interested 
in the part called Delta: delta is Greek letter A, which is the equivalent of D, and is often 
used in statistics to mean ‘difference’. 6 The anova() function calculates the difference in the 
likelihoods for the two models, which is 20.30 - 0 = 20.30, and the difference in df, which 
is 1 - 0 = 1. You can see that this is a useful function, because it has done some literally 
brain-melting sums for us: I did warn you we were being lazy. 

In the column labelled P(> Delta(Dev) we see the p-value of the difference between the 
models. This value is less than .001 and, therefore, highly significant. This significant result 
tells us that removing the three-way interaction has made the model a significantly worse 
fit to the data. In other words, the three-way interaction is a significant factor in making 
the model a good fit. It also means that for interpretation purposes we need to stick with 
the saturated model. We should now stop and conclude that the three-way interaction is 
significant, and interpret the effect. 

LR tests for hierarchical log-linear models 
Model 1: 

. - Training + Animal + Dance 
Model 2: 

-Animal * Training * Dance 

Deviance df Delta(Dev) Delta(df) P(> Delta(Dev) 

Model 1 20.30491 1 

Model 2 0.00000 0 20.30491 1 le-05 

Saturated 0.00000 0 0.00000 0 le+00 

Output 18.11 

For illustrative purposes let’s pretend that we don’t need to stop, and carry on. Let’s 
create models that systematically remove the two-way interactions: 

trainingDance<-update(threeWay, -Training:Dance) 

animalDance<-update(threeWay, -Animal:Dance) 

animalTrainingc-updatefthreeWay, -Animal:Training) 

The first command creates a model called trainingDance that takes that threeWay model 
and removes the Training x Dance interaction (i.e., it does not include either this inter¬ 
action or the three-way interaction). The second does the same but removes the Animal 
x Dance interaction. The final command again takes the threeWay model but this time 
removes the Training x Animal interaction. We can compare all of these models to the 
model without the three-way interaction using the anova() function: 

anovaCthreeWay, trainingDance) 
anova(threeWay, animalDance) 
anovaCthreeWay, animalTraining) 

Output 18.12 shows the result of the first comparison, which shows us the effect of 
removing the Training x Dance interaction: the likelihood ratio difference (or delta) is 8.6 
(28.9 - 20.3) with 2-1 = 1 degrees of freedom. This difference is significant, at p = .003, 
and therefore we cannot remove the Training x Dance interaction from the model without 
the fit getting worse (in other words, this interaction is significant too). 

Deviance df Delta(Dev) Delta(df) P(> Delta(Dev) 

Model 1 28.91551 2 

Model 2 20.30491 1 8.610596 1 0.00334 

Saturated 0.00000 0 20.304911 1 0.00001 

Output 18.12 

6 Sometimes people say ‘What’s the delta?’ when they mean ‘What’s the difference?’ If you ever meet anyone who 
says this, you’ll know what they mean (and you’ll know that they are a pretentious prig). 
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Output 18.13 shows the effect of removing the Animal x Dance effect. Now we get a 
likelihood ratio difference of 13.75, with 1 df. The p-value is < .001, and therefore we can¬ 
not remove the Animal x Dance effect without making the fit of the model worse. 

Deviance df Delta(Dev) Delta(df) P(> Delta(Dev) 

Model 1 34.05329 2 

Model 2 20.30491 1 13.74838 1 0.00021 

Saturated 0.00000 0 20.30491 1 0.00001 

Output 18.13 

Output 18.14 shows the effect of removing the Animal x Training interaction. The dif¬ 
ference here is 13.76, with 1 df. Again this is highly significant, and therefore this effect 
cannot be removed from the model without making the fit worse. 

Deviance df Delta(Dev) Delta(df) P(> Delta(Dev) 

Model 1 34.06486 2 

Model 2 20.30491 1 13.75995 1 0.00021 

Saturated 0.00000 0 20.30491 1 0.00001 

Output 18.14 

The next step is to try to interpret the three-way interaction (remember 
we looked at the two-way interactions only for illustrative purposes). The 
first useful thing we can do is to plot the frequencies across all of the differ¬ 
ent categories. You should plot the frequencies in terms of the percentages. 

You should also look at the mosaic plot. We can obtain this plot by using the 
mosaicplot() function and applying it to our contingency table: 

mosaicplot(CatDogContingencyTable, shade = TRUE, main = "Cats and Dogs") 
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Executing this command creates the mosaic plot in Figure 18.6. This plot shows what we 
already know about cats: they will dance (or do anything else for that matter) when there 
is food involved but if you train them with affection they’re not interested. Dogs on the 
other hand will dance when there’s affection involved (actually more dogs danced than 
didn’t dance regardless of the type of reward, but the effect is more pronounced when 
affection was the training method). In fact, both animals show similar responses to food 
training, it’s just that cats won’t do anything for affection. So cats are sensible creatures 
that only do stupid stuff when there’s something in it for them (i.e., food), whereas dogs 
are just plain stupid. 


18.10. Following up loglinear analysis © 


An alternative way to interpret a three-way interaction is to conduct chi-square analysis at 
different levels of one of your variables. For example, to interpret our Animal x Training 
x Dance interaction, we could perform a chi-square test on Training and Dance but do this 
separately for dogs and cats (in fact the analysis for cats will be the same as the example we 
used for chi-square). You can then compare the results in the different animals. 




SELF-TEST 

s Use the subset() function to run a chi-square test on 
Dance and Training for dogs and cats separately. 


Pearson's Chi-squared test 


Chi A 2 = 3.932462 d.f. = 1 p = 0.04736256 

Pearson's Chi-squared test with Yates' continuity correction 


Chi A 2 = 2.965686 d.f. =1 p = 0.08504836 


Fisher's Exact Test for Count Data 


Sample estimate odds ratio: 0.3502677 

Alternative hypothesis: true odds ratio is not equal to 1 
p = 0.06797422 

95% confidence interval: 0.1001861 1.127025 

Alternative hypothesis: true odds ratio is less than 1 
p = 0.04208209 

95% confidence interval: 0 0.9586623 

Alternative hypothesis: true odds ratio is greater than 1 
p = 0.9880647 

95% confidence interval: 0.1208492 Inf 


Output 18.15 
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The results and interpretation for cats are in Output 18.3 and for dogs the output is 
shown in Output 18.15. For dogs there is still a significant relationship between the types 
of training and whether they danced but it is weaker (the chi-square is 3.93 compared to 
25.4 in the cats). 7 This reflects the fact that dogs are more likely to dance if given affection 
than if given food, the opposite of cats. 


18.11. Effect sizes in loglinear analysis © 


As with Pearson’s chi-square, one of the most elegant ways to report your effects is in terms 
of odds ratios. Odds ratios are easiest to understand for 2x2 contingency tables and so if 
you have significant higher-order interactions, or your variables have more than two cat¬ 
egories, it is worth trying to break these effects down into logical 2x2 tables and calculat¬ 
ing odds ratios that reflect the nature of the interaction. So, for example, in this example 
we could calculate odds ratios for dogs and cats separately. We have the odds ratios for 
cats already (section 18.6.7), and for dogs we would get 0.35 as reported in Output 18.15. 



This tells us that if a dog was trained with food the odds of their dancing were 0.35 
times the odds if they were rewarded with affection (i.e., they were less likely to dance). 
Another way to say this is that the odds of their dancing were 1/0.35 = 2.90 times lower if 
they were trained with food instead of affection. Compare this to cats where the odds of 
dancing were 6.58 higher if they were trained with food rather than affection. As you can 
see, comparing the odds ratios for dogs and cats is an extremely elegant way to present the 
three-way interaction term in the model. 


18.12. Reporting the results 
of loglinear analysis © 


When reporting loglinear analysis you need to report the likelihood ratio statistic for the 
final model, usually denoted just by x 2 - For any terms that are significant you should report 
the chi-square change. For this example we could report: 

^ The three-way loglinear analysis produced a final model that retained all effects. The 
likelihood ratio of this model was x 2 (0) = 0, p = 1. This indicated that the highest 
order interaction (the Animal x Training x Dance interaction) was significant, x 2 (1) = 
20.31, p < .001. To break down this effect, separate chi-square tests on the Training 
and Dance variables were performed separately for dogs and cats. For cats, there was 


7 The chi-square statistic depends on the sample size, so really you need to calculate effect sizes and compare them 
to make this kind of statement (unless you had equal numbers of dogs and cats). 
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a significant association between the type of training and whether or not cats would 
dance, x 2 (1) =25.36, p < .001; this was true in dogs also, x 2 (1) =3.93, p < .05. Odds 
ratios indicated that the odds of dancing were 6.58 higher after food than affection in 
cats, but only 0.35 in dogs (i.e., in dogs, the odds of dancing were 2.90 times lower 
if trained with food compared to affection). Therefore, the analysis seems to reveal a 
fundamental difference between dogs and cats: cats are more likely to dance for food 
rather than affection, whereas dogs are more likely to dance for affection than food. 



CRAMMING SAM’S TIPS 


Loglinear analysis 


• If you want to test the relationship between more than two categorical variables you can do this with loglinear analysis. 

• Loglinear analysis is hierarchical: start with a model containing all main effects and interactions. Starting with the highest- 
order interaction, remove terms to see whether their removal significantly affects the fit of the model. If it does then this term 
is not removed, it is interpreted and all lower-order effects are ignored. 

• Look at the crosstabulation table to interpret any significant effects (the percentage of total for cells is the best thing to look at). 




What have I discovered about statistics? © 


When I wrote the first edition of the SPSS version of this book I had always intended 
to do a chapter on loglinear analysis, but by the time I got to that chapter I had already 
written 300 pages more than I was contracted to do, and had put so much effort into the 
rest of it that, well, the thought of that extra chapter was making me think of large cliffs 
and jumping. When the second edition needed to be written, I wanted to make sure that 
at the very least I did a loglinear chapter. However, when I came to it, I’d already written 
200 pages more than I was supposed to for this new edition, and with deadlines fading 
into the distance, history was repeating itself. It won’t surprise you to know then that 
I was really happy to have written the damn thing! This chapter has taken a very brief 
look at analysing categorical data. What I’ve tried to do is to show you how we approach 
categorical data in much the same way as any other kind of data: we fit a model, we 
calculate the deviation between our model and the observed data, and we use that to 
evaluate the model we’ve fitted. I’ve also tried to show that the model we fit is the same 
one that we’ve come across throughout this book: it’s a linear model (regression). When 
we have only two variables we can use Pearson’s chi-square test or the likelihood ratio 
test to look at whether those two variables are associated. In more complex situations, 
we simply extend these models into something known as a loglinear model. This is a bit 
like ANOVA for categorical data: for every variable we have, we get a main effect but we 
also get interactions between variables. Loglinear analysis simply evaluates all of these 
effects hierarchically to tell us which ones best predict our outcome. 

Fortunately the experience of this loglinear chapter taught me a valuable lesson, which is 
never to agree to write a chapter about something that you know very little about, and if you 
do then definitely don’t leave it until the very end of the writing process when you’re under 
pressure and mentally exhausted. It’s lucky that we learn from our mistakes, isn’t it... ? 
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R packages used in this chapter 

gmodels I MASS 

R functions used in this chapter 


anova() 

mosaicpIotO 

c() 

rep() 

cbind() 

subset() 

CrossTable() 

summary!) 

factor() 

table() 

gim() 

update() 

loglm() 

xtabs() 

ImO 


Key terms that I’ve discovered 

Chi-square test 

Mosaic plot 

Contingency table 

Odds ratio 

Fisher’s exact test 

Phi 

Loglinear analysis 

Saturated model 

McNemar’s test 

Yates’s continuity 


Smart Alex’s tasks ® 


• Task 1: Certain editors at Sage like to think they’re a bit of a whiz at football (soccer 
if you prefer). To see whether they are better than Sussex lecturers and postgradu¬ 
ates we invited various employees of Sage to join in our football matches (oh, sorry, I 
mean we invited them down for important meetings about books). Every player was 
allowed to play in only one match. Over many matches, we counted the number of 
players who scored goals. The data are in the file SageEditorsCan’tPlayFootball.dat. 
Do a chi-square test to see whether more publishers or academics scored goals. We 
predict that Sussex people will score more than Sage people. © 



• Task 2: In 2008 I had a sabbatical in the Netherlands (I have a real soft spot for 
Holland). However, living there for three months did enable me to notice certain 
cultural differences between Holland and England. The Dutch are famous for travel¬ 
ling by bike; they do it much more than the English. However, I noticed that many 
more Dutch people cycle while steering with only one hand. I pointed this out to one 
of my friends, Birgit Mayer, and she said that I was being a crazy English fool and 
that Dutch people did not cycle one-handed. Several weeks of me pointing at one- 
handed cyclists and her pointing at two-handed cyclists ensued. To put it to the test 
I counted the number of Dutch and English cyclists who ride with one or two hands 
on the handlebars (Handlebars.dat). Can you work out which one of us is right? © 


• Task 3: I was interested in whether horoscopes are just a figment of people’s minds. 
Therefore, I got 2201 people, made a note of their star sign (this variable, obviously, 
has 12 categories: Capricorn, Aquarius, Pisces, Aries, Taurus, Gemini, Cancer, Leo, 
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Virgo, Libra, Scorpio and Sagittarius) and whether they believed in horoscopes (this 
variable has two categories: believer or unbeliever). I then sent them a horoscope in 
the post of what would happen over the next month: everybody, regardless of their 
star sign, received the same horoscope, which read ‘August is an exciting month for 
you. You will make friends with a tramp in the first week of the month and cook him 
a cheese omelette. Curiosity is your greatest virtue, and in the second week, you’ll 
discover knowledge of a subject that you previously thought was boring, statistics 
perhaps. You might purchase a book around this time that guides you towards this 
knowledge. Your new wisdom leads to a change in career around the third week, when 
you ditch your current job and become an accountant. By the final week you find 
yourself free from the constraints of having friends, your boy I girlfriend has left you 
for a Russian ballet dancer with a glass eye, and you now spend your weekends doing 
loglinear analysis by hand with a pigeon called Hephzibah for company. ’ At the end of 
August I interviewed all of these people and I classified the horoscope as having come 
true, or not, based on how closely their lives had matched the fictitious horoscope. 
The data are in the file Horoscope.dat. Conduct a loglinear analysis to see whether 
there is a relationship between the person’s star sign, whether they believe in horo¬ 
scopes and whether the horoscope came true. © 



• Task 4: On my statistics course students have weekly classes in a computer labora¬ 
tory. Postgraduate tutors run these classes but I often pop in to help out. I’ve noticed 
in these sessions that many students are studying Facebook rather more than they are 
studying the very interesting statistics assignments that I have set them. I wanted to 
see the impact that this behaviour had on their exam performance. I collected data 
from all 260 students on my course. First I checked their Attendance and classified 
them as having attended either more or less than 50% of their lab classes. Next, I clas¬ 
sified them as being either someone who looked at Facebook during their lab class, 
or someone who never did. Lastly, after the Research Methods in Psychology exam, 
I classified them as having either passed or failed (Exam). The data are in Facebook. 
dat. Do a loglinear analysis on the data to see if there is an association between study¬ 
ing Facebook and failing your exam. © 

Answers can be found on the companion website. 


Further reading 


Hutcheson, G., & Sofroniou, N. (1999). The multivariate social scientist. London: Sage. 
Tabachnick, B. G. & Fidell, L. S. (2007). Using multivariate statistics (4th ed.). Boston: Allyn & 
Bacon. (Chapter 16 is a fantastic account of loglinear analysis.) 


Interesting real research 


Beckham, A. S. (1929). Is the Negro happy? A psychological analysis. Journal of Abnormal and Social 
Psychology, 24, 186-190. 
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FIGURE 19.1 

Having a therapy 
session in 2007 


19.1. What will this chapter tell me? © 


Over the last couple of chapters we saw that I had gone from a child having dreams and 
aspirations of being a rock star, to becoming a living (barely) statistical test. A more dra¬ 
matic demonstration of my complete failure to achieve my life’s ambitions I can scarcely 
imagine. Having devoted far too much of my life to statistics, it was time to unlock the 
latent rock star once more. The second edition of the SPSS version of this book had left 
me in desperate need for some therapy and, therefore, at the age of 29 I decided to start 
playing the drums (there’s a joke in there somewhere about it being the perfect instrument 
for a failed musician, but really they’re much harder to play than people think). A couple 
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of years later I had a call from an old friend of mine, Doug, who used to be in a band that 
my old band Scansion used to play with a lot: ‘Remember the last time I saw you we talked 
about you coming and having a jam with us?’ I had absolutely no recollection whatsoever 
of him saying this, so I responded ‘Yes’. ‘Well, how about it then?’ he said. ‘OK,’ I said, 
‘you arrange it and I’ll bring my guitar.’ ‘No, you whelk,’ he said, ‘we want you to drum 
and maybe you could learn some of the songs on the CD I gave you last year?’ I’d played 
his band’s CD and I liked it, but there was no way on this earth that I could play the drums 
as well as their drummer. ‘Sure, no problem’, I lied. I spent the next two weeks playing 
along to this CD as if my life depended on it and when the rehearsal came, much as I’d love 
to report that I drummed like a lord, I didn’t. I did, however, nearly have a heart attack 
and herniate everything in my body that it’s possible to herniate (really, the music is pretty 
fast!). Still, we had another rehearsal, and then another and, well, many years down the 
line we’re still having them. The only difference is that now I can play the songs at a speed 
that makes their old recordings seem as though a sedated snail was on the drums (www. 
myspace.com/fracturepattern). The point is that it’s never too late to learn something new. 
This is just as well because, as a man who clearly doesn’t learn from his mistakes, I agreed 
to write a chapter on multilevel linear models, a subject about which I know absolutely 
nothing. I’m writing it last, when I feel mentally exhausted and stressed. Hopefully at some 
point between now and the end of writing the chapter I will learn something. With a bit 
of luck you will too. 


19.2. Hierarchical data © 


In all of the analyses in this book so far we have treated data as though they 
are organized at a single level. However, in the real world, data are often 
hierarchical. This just means that some variables are clustered or nested within 
other variables. For example, when I’m not writing statistics books I spend 
most of my time researching how anxiety develops in children below the age 
of 10. This typically involves my running experiments in schools. When I run 
research in a school, I test children who have been assigned to different classes, 
and who are taught by different teachers. The classroom that a child is in could 
conceivably affect my results. Let’s imagine I test in two different classrooms. 
The first class is taught by Mr. Nervous. Mr. Nervous is very anxious and often 
when he supervises children he tells them to be careful, or that things that they do are 
dangerous, or that they might hurt themselves. The second class is taught by Little Miss 
Daredevil. 1 She is very carefree and she believes that children in her class should have 
the freedom to explore new experiences. Therefore, she is always telling them not to be 
scared of things and to explore new situations. One day I go into the school to test the 
children. I take in a big animal carrier, which I tell them has an animal inside. I measure 
whether they will put their hand in the carrier to stroke the animal. Children taught by 
Mr. Nervous have grown up in an environment where their teacher reinforces caution, 
whereas children taught by Miss Daredevil have been encouraged to embrace new experi¬ 
ences. Therefore, we might expect Mr. Nervous’s children to be more reluctant to put 
their hand in the box because of the classroom experiences that they have had. The class¬ 
room is, therefore, known as a contextual variable. In reality, as an experimenter I would 
be interested in a much more complicated situation. For example, I might tell some of 
the children that the animal is a bloodthirsty beast, whereas I tell others that the animal is 



1 Those of you who don’t spot the Mr. Men references here, check out http://www.mrmen.com. Mr. Nervous 
used to be called Mr. Jelly and was a pink jelly-shaped blob, which in my humble opinion was better than his 
current incarnation. 
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FIGURE 19.2 

An example 
of a two-level 
hierarchical data 
structure: children 
(level 1) are 
organized within 
classrooms 
(level 2) 


friendly. Now obviously I’m expecting the information I give the children to affect their 
enthusiasm for stroking the animal. However, it’s also possible that their classroom has 
an effect. Therefore, my manipulation of the information that I give the children also has 
to be placed within the context of the classroom to which the child belongs. My threat 
information is likely to have more impact on Mr. Nervous’s children than it will on Miss 
Daredevil’s children. One consequence of this is that children within Mr. Nervous’s class 
will be more similar to each other than they are to children in Miss Daredevil’s class and 
vice versa. 

Figure 19.2 illustrates this scenario more generally. In a big data set, we might have col¬ 
lected data from lots of children. This is the bottom of the hierarchy and is known as a 
level 1 variable. So, children (or cases) are our level 1 variable. However, these children are 
organized by classroom (children are said to be nested within classes). The class to which 
a child belongs is a level up from the participant in the hierarchy and is said to be a level 
2 variable. 

The situation that I have just described is the simplest hierarchy that you can have 
because there are just two levels. However, you can have other layers to your hierarchy. 
The easiest way to explain this is to stick with our example of my testing children in dif¬ 
ferent classes and then to point out the obvious fact that classrooms are themselves nested 
within schools. Therefore, if I ran a study incorporating lots of different schools, as well 
as different classrooms within those schools, then I would have to add another level to the 
hierarchy. We can apply the same logic as before, in that children in particular schools will 
be more similar to each other than to children in different schools. This is because schools 
tend to reflect their social demographic (which can differ from school to school) and they 
may differ in their policies also. Figure 19.3 shows this scenario. There are now three lev¬ 
els in the hierarchy: the child (level 1), the class to which the child belongs (level 2) and 
the school within which that class exists (level 3). In this situation we have two contextual 
variables: school and classroom. 

Hierarchical data structures need not apply only to between-participant situations. We 
can also think of data as being nested within people. In this situation the case, or person, is 
not at the bottom of the hierarchy (level 1), but is further up. A good example is memory. 
Imagine that after giving children threat information about my caged animal I asked them a 
week later to recall everything they could about the animal. For each child there are many 
facts that they could recall. Let’s say that I originally gave them 15 pieces of information; 
some children might recall all 15 pieces of information, but others might remember only 
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FIGURE 19.3 

An example of 
a three-level 
hierarchical data 
structure 


Level 



two or three bits of information. The bits of information, or memories, are nested within 
the person and their recall depends on the person. The probability of a given memory 
being recalled depends on what other memories are available, and the recall of one mem¬ 
ory may have knock-on effects for what other memories are recalled. Therefore, memories 
are not independent units. As such, the person acts as a context within which memories are 
recalled (Wright, 1998). 

Figure 19.4 shows the structure of the situation that I have just described. The child is 
our level 2 variable, and within each child there are several memories (our level 1 vari¬ 
able). Of course we can also have levels of the hierarchy above the child. So, we could 
still, for example, factor in the context of the class from which they came (as I have done 
in Figure 19.4) as a level 3 variable. Indeed, we could even include the school again as a 
level 4 variable. 


FIGURE 19.4 

An example of 
a three-level 
hierarchical 
data structure, 
where the level 
1 variable is a 
repeated measure 
(memories 
recalled) 


Level 
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19 . 2 . 1 . 


The intraclass correlation © 


You might well wonder why it matters that data are hierarchical (or not). The main prob¬ 
lem is that the contextual variables in the hierarchy introduce dependency in the data. 
In plain English, this means that residuals will be correlated. I have alluded to this fact 
already when I noted that children in Mr. Nervous’s class would be more similar to each 
other than to children in Miss Daredevil’s class. In some sense, having the same teacher 
makes children more similar to each other. This similarity is a problem because in nearly 
every test we have covered in this book we assume that cases are independent. In other 
words, there is absolutely no correlation between residual scores of one child and another. 
However, when entities are sampled from similar contexts, this independence is unlikely 
to be true. For example, Charlotte and Emily’s responses to the animal in the carrier have 
both been influenced by Mr. Nervous’s cautious manner, so their behaviour will be similar. 
Likewise, Kiki and Jip’s responses to the animal in the box have both been influenced by 
Miss Daredevil’s carefree manner, so their behaviour will be similar too. We have seen 
before that in ANOVA, for example, a lack of independence between cases is a huge prob¬ 
lem that really affects the resulting test statistic - and not in a good way! (See section 10.3.) 

By thinking about contextual variables and factoring them into the analysis we can over¬ 
come this problem of non-independent observations. One way that we can do this is to use 
the intraclass correlation (ICC). We came across this measure in section 17.9.3 as a meas¬ 
ure of inter-rater reliability, but it can also be used as a measure of dependency between 
scores. We’ll skip the formalities of calculating the ICC (but see Oliver Twisted if you’re 
keen to know), and we’ll just give a conceptual grasp of what it represents. In our two- 
level example of children within classes, the ICC represents the proportion of the total 
variability in the outcome that is attributable to the classes. It follows that if a class has had 
a big effect on the children within it then the variability within the class will be small (the 
children will behave similarly). As such, variability in the outcome within classes is mini¬ 
mized, and variability in the outcome between classes is maximized; therefore, the ICC is 
large. Conversely, if the class has little effect on the children then the outcome will vary a 
lot within classes, which will make differences between classes relatively small. Therefore, 
the ICC is small too. Thus, the ICC tells us that variability within levels of a contextual 
variable (in this case the class to which a child belongs) is small, but between levels of a con¬ 
textual variable (comparing classes) is large. As such, the ICC is a good gauge of whether a 
contextual variable has an effect on the outcome. 



OLIVER TWISTED 

Please Sir, can I have 
some more ... ICC? 


‘I have a dependency on gruel’, whines Oliver. ‘Maybe I could measure 
this dependency if I knew more about the ICC.’ Well, you’re so high 
on gruel, Oliver, that you have rather missed the point. Still, I did write 
an article on the ICC once upon a time (Field, 2005a) and it’s repro¬ 
duced in the additional web material for your delight and amusement. 


19 . 2 . 2 . 


Benefits of multilevel models © 


Multilevel linear models have numerous uses. To convince you that trawling through this 
chapter is going to reward you with statistical possibilities beyond your wildest dreams, 
here are just a few (slightly overstated) benefits of multilevel models: 
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• Cast aside the assumption of homogeneity of regression slopes. We saw in Chapter 
11 that when we use analysis of covariance we have to assume that the relationship 
between our covariate and our outcome is the same across the different groups that 
make up our predictor variable. However, this doesn’t always happen. Luckily, in 
multilevel models we can explicitly model this variability in regression slopes, thus 
overcoming this inconvenient problem. 

• Say ‘bye bye’ to the assumption of independence. We saw in Chapter 10 that when 
we use independent ANOVA we have to assume that the different cases of data are 
independent. If this is not true, little lizards climb out of your mattress while you’re 
asleep and eat you. Again, multilevel models are specifically designed to allow you 
to model these relationships between cases. Also, in Chapter 7 we saw that multiple 
regression relies on having independent observations. However, there are situations 
in which you might want to measure someone on more than one occasion (i.e., over 
time). Ordinary regression turns itself into cheese and hides in the fridge at the pros¬ 
pect of cases of data that are related. Multilevel models eat these data for breakfast, 
with a piece of regression-flavoured cheese. 

• Laugh in the face of missing data. I’ve spent a lot of this book extolling the virtues 
of balanced designs and not having missing data. Regression, ANOVA, ANCOVA 
and most of the other tests we have covered do strange things when data are missing 
or the design is not balanced. This can be a real pain. Missing data are a particular 
problem within clinical trials because it is common to attempt to collect follow-up 
data, often many months after treatment has ended when patients might be difficult 
to track down. Of course, there are ways to correct for and impute missing data, but 
these techniques are often quite complicated (Yang, Li, &C Shoptaw, 2008), there¬ 
fore, often when using repeated-measures designs if a single time point is missing the 
whole case usually needs to be deleted; missing data leads to more data being deleted. 
Multilevel models do not require complete data sets and so when data are missing 
for one time point they do not need to be imputed, nor does the whole case need 
to be deleted. Instead parameters can be estimated successfully with the available 
data, which offers a relatively easy solution to dealing with missing data. It is impor¬ 
tant to stress that no statistical procedure can overcome data that are missing. Good 
methods, designs and research execution should be used to minimize missing values, 
and reasons for missing values should always be explored. It is just that when using 
traditional statistical procedures for repeated-measures data additional procedures to 
account for missing data are usually necessary and can be problematic. 

I think you’ll agree that multilevel models are pretty funky. ‘Is there anything they can’t 

do?’ I hear you cry. Well, no, not really. 


19.3. Theory of multilevel linear models ® 


The underlying theory of multilevel models is very complicated indeed - far too compli¬ 
cated for my little peanut of a brain to comprehend. Fortunately, the advent of computers 
and software like R makes it possible for feeble-minded individuals such as myself to take 
advantage of this wonderful tool without actually needing to know the maths. Better still, 
this means I can get away with not explaining the maths (and really, I’m not kidding, I don’t 
understand any of it). What I will do, though, is try to give you a flavour of what multi¬ 
level models are and what they do by describing the key concepts within the framework of 
linear models that has permeated this whole book. I also want to remind you that if you 
have worked through Chapters 13 and 14 then you have already done a multilevel model 
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and used the lme() function that we discuss in this chapter, because we used it to analyse 
repeated-measures designs. In these repeated-measures designs can be thought of as a two- 
level hierarchy in which scores (level 1) are nested within participants (level 2). 


19 . 3 . 1 . 


An example (D 


Throughout the first part of the chapter we will use an example to illustrate some of the 
concepts in multilevel models. Cosmetic surgery is on the increase at the moment. In the 
USA, there was a 1600% increase in cosmetic surgical and non-surgical treatments between 
1992 and 2002, and in 2004, 65,000 people in the UK underwent privately and publicly 
funded operations (Kellett, Clarke, &C McGill, 2008). With the increasing popularity of this 
surgery, many people are starting to question the motives of those who want to go under 
the knife. There are two main reasons to have cosmetic surgery: (1) to help a physical prob¬ 
lem such as having breast reduction surgery to relieve backache; and (2) to change your 
external appearance, for example by having a face lift. Related to this second point, there 
is even some case for arguing that cosmetic surgery could be performed as a psychological 
intervention: to improve self-esteem (Cook, Rosser, &c Salmon, 2006; Kellett et ah, 2008). 
The main example for this chapter looks at the effects of cosmetic surgery on quality of life. 
The variables in the data file are (Cosmetic Surgery.dat): 

• Post_QoL: This is a measure of quality of life after the cosmetic surgery. This is our 
outcome variable. 

• Base_QoL: We need to adjust our outcome for quality of life before the surgery. 

• Surgery: This variable is a dummy variable that specifies whether the person has 
undergone cosmetic surgery (1) or whether they are on the waiting list (0), which 
acts as our control group. 

• Surgery_Text: This variable is the same as above but specifies group membership as 
text (we will use this variable when we create graphs but not for the main analysis). 

• Clinic: This variable specifies which of 10 clinics the person attended to have their 
surgery. 

• Age: This variable tells us the person’s age in years. 

• BDI: It is becoming increasingly apparent that people volunteering for cosmetic sur¬ 
gery (especially when the surgery is purely for vanity) might have very different per¬ 
sonality profiles than the general public (Cook, Rossera, Toone, James, &C Salmon, 
2006). In particular, these people might have low self-esteem or be depressed. When 
looking at quality of life it is important to assess natural levels of depression, and this 
variable used the Beck Depression Inventory (BDI) to do just that. 

• Reason: This dummy variable specifies whether the person had/is waiting to have 
surgery purely to change their appearance (0), or because of a physical reason (1). 

• Reason_Text: This variable is the same as above but contains text to define each 
group rather than a number. 

• Gender: This variable simply specifies whether the person was a man (1) or a woman (0). 

When conducting hierarchical models we generally work up from a very simple model 
to more complicated models, and we will take that approach in this chapter. In doing so 
I hope to illustrate multilevel modelling by attaching it to frameworks that you already 
understand, such as ANOVA and ANCOVA. 
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FIGURE 19.5 

Diagram to show 
the hierarchical 
structure of 
the cosmetic 
surgery data 
set. People are 
clustered within 
clinics. Note that 
for each person 
there would be a 
series of variables 
measured: 
surgery, BDI, age, 
gender, reason 
and pre-surgery 
quality of life 
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Figure 19.5 shows the hierarchical structure of the data. Essentially, people being treated 
in the same surgeries are not independent of each other because they will have had surgery 
from the same surgeon. Surgeons will vary in how good they are, and quality of life will to 
some extent depend on how well the surgery went (if they did a nice neat job then qual¬ 
ity of life should be higher than if they left you with unpleasant scars). Therefore, people 
within clinics will be more similar to each other than people in different clinics. As such, 
the person undergoing surgery is the level 1 variable, but there is a level 2 variable, a vari¬ 
able higher in the hierarchy, which is the clinic attended. 


19 . 3 . 2 . 


Fixed and random coefficients © 


Throughout this book we have discussed effects and variables, and these concepts should 
be very familiar to you by now. However, we have viewed these effects and variables in a 
relatively simple way: we have not distinguished between whether something is fixed or 
random. 

What we mean by ‘fixed’ and ‘random’ can be a bit confusing because the terms are used 
in a variety of contexts. You hear people talk about fixed effects and random effects. An 
effect in an experiment is said to be a fixed effect if all possible treatment conditions that 
a researcher is interested in are present in the experiment. An effect is said to be random if 
the experiment contains only a random sample of possible treatment conditions. This dis¬ 
tinction is important because fixed effects can be generalized only to the situations in your 
experiment, whereas random effects can be generalized beyond the treatment conditions in 
the experiment (provided that the treatment conditions are representative). For example, in 
our Viagra example from Chapter 10, the effect is fixed if we say that we are interested only 


























































CHAPTER 19 MULTILEVEL LINEAR MODELS 


863 


in the three conditions that we had (placebo, low dose and high dose) and we can generalize 
our findings only to the situation of a placebo, low dose and high dose. However, if we were 
to say that the three doses were only a sample of possible doses (maybe we could have tried 
a very high dose), then it is a random effect and we can generalize beyond just placebos, low 
doses and high doses. All of the effects in this book so far we have treated as fixed effects. 
The vast majority of academic research that you read will treat variables as fixed effects. 

People also talk about fixed variables and random variables. A fixed variable is one that is 
not supposed to change over time (e.g., for most people their gender is a fixed variable - it 
never changes), whereas a random one varies over time (e.g., your weight is likely to fluctu¬ 
ate over time). 

In the context of multilevel models we need to make a distinction between fixed coeffi¬ 
cients and random coefficients. In the regressions, ANOVAs and ANCOVAs throughout this 
book we have assumed that the regression parameters are fixed. We have seen numerous 
times that a linear model is characterized by two things: the intercept, b Q , and the slope, bp 

Yi = b 0 + b x X u + e ; 


Note that the outcome (Y), the predictor (X) and the error (e) all vary as a function of i, 
which normally represents a particular case of data. In other words, it represents the level 
1 variable. If, for example, we wanted to predict Sam’s score, we could replace the is with 
her name: 


^Sam “ ^0 + ^l^l.Sam + £ Sam 


This is just some basic revision. Now, when we do a regression like this we assume that 
the bs are fixed and we estimate them from the data. In other words, we’re assuming that 
the model holds true across the entire sample and that for every case of data in the sample 
we can predict a score using the same values of the gradient and intercept. However, we 
can also conceptualize these parameters as being random. 2 If we say that a parameter is 
random then we assume not that it is a fixed value, but that its value can vary. Up until 
now we have thought of regression models as having fixed intercepts and fixed slopes, but 
this opens up three new possibilities for us that are shown in Figure 19.6. This figure uses 
the data from our ANCOVA example in Chapter 11 and shows the relationship between 
a person’s libido and that of their partner overall (the dashed line) and separately for the 
three groups in the study (a placebo group, a group that had a low dose of Viagra and a 
group that had a high dose). 


19.3.2.1. The random intercept model (D 


The simplest way to introduce random parameters into the model is to assume that the 
intercepts vary across contexts (or groups) - because the intercepts vary, we call them ran¬ 
dom intercepts. For our libido data this is like assuming that the relationship between libido 
and partner’s libido is the same in the placebo, low- and high-dose groups (i.e., the slope is 
the same), but that the models for each group are in different locations (i.e., the intercepts 
are different). This is shown in the top panel of Figure 19.6, in which the models within 
the different contexts (colours) have the same shape (slope) but are located in different 
geometric space (they have different intercepts). 


2 In a sense ‘random’ isn’t an intuitive term for us non-statisticians because it implies that values are plucked out 
of thin air (randomly selected). However, this is not the case - they are carefully estimated just as fixed parameters 


are. 
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FIGURE 19.6 

Data sets showing 
an overall model 
(dashed line) and 
the models for 
separate contexts 
within the data 
(i.e., groups of 
cases) 
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Random Intercept, 
Random Slope 
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19.3.2.2. Random slope model (D 

We can also assume that the slopes vary across contexts - i.e., we assume random slopes. 
For our libido data this is like assuming that the relationship between libido and partner’s 
libido is different in the placebo, low- and high-dose groups (i.e., the slopes are different), 
but that the models for each group are fixed at the same geometric location (i.e., the inter¬ 
cepts are the same). This is what happens when we violate the assumption of homogeneity 
of regression slopes in ANCOVA. Homogeneity of regression slopes is the assumption that 
regression slopes are the same across contexts. If this assumption is not tenable then we 
can use a multilevel model to explicitly estimate that variability in slopes. This is shown in 
the middle panel of Figure 19.6 in which the models within the different contexts (colours) 
converge on a single intercept but have different slopes. It’s worth noting that it would 
be unusual in reality to assume random slopes without also assuming random intercepts 
because variability in the nature of the relationship (slopes) would normally create variabil¬ 
ity in the overall level of the outcome variable (intercepts). Therefore, if you assume that 
slopes are random you would normally also assume that intercepts are random. 


19.3.2.3. The random intercept and slope model (D 


The most realistic situation is to assume that both intercepts and slopes vary around the 
overall model. This is shown in the bottom panel of Figure 19.6 in which the models 
within the different contexts (colours) have different slopes but are also located in different 
geometric space and so have different intercepts. 


19.4. The multilevel model © 


We have seen conceptually what a random intercept, random slope and random intercept 
and slope model looks like. Now let’s look at how we actually represent the models. To 
keep things concrete, let’s use our example. For the sake of simplicity, let’s imagine first 
that we wanted to predict someone’s quality of life (QoL) after cosmetic surgery. We can 
represent this as a linear model as follows: 

QoL After Surgery,- = b 0 + ^Surgery,- + e,- (19.1) 

We have seen equations like this many times and it represents a linear model: regression, 
a t-test (in this case) and ANOVA. In this example, we had a contextual variable, which 
was the clinic in which the cosmetic surgery was conducted. We might expect the effect of 
surgery on quality of life to vary as a function of which clinic the surgery was conducted at 
because surgeons will differ in their skill. This variable is a level 2 variable. As such we could 
allow the model that represents the effect of surgery on quality of life to vary across the 
different contexts (clinics). We can do this by allowing the intercepts to vary across clinics, 
or by allowing the slopes to vary across clinics or by allowing both to vary across clinics. 

To begin with, let’s say we want to include a random intercept for quality of life. All 
we do is add a component to the intercept that measures the variability in intercepts, u Qj . 
Therefore, the intercept changes from b Q to become (b Q + u Q .). This term estimates the inter¬ 
cept of the overall model fitted to the data, b Q , and the variability of intercepts around that 
overall model, m... The overall model becomes: 3 

Xy = (&o + u oj ) + K x ij + E ij (19.2) 

3 Some people use gamma (y), not b, to represent the parameters, but I prefer b because it makes the link to the 
other linear models that we have used in this book clearer. 
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The js in the equation reflect levels of the variable over which the intercept varies (in 
this case the clinic) - the level 2 variable. Another way that we could write this is to take 
out the error terms so that it looks like an ordinary regression equation except that the 
intercept has changed from a fixed, b Q , to a random one, b Q ., which is defined in a separate 
equation: 


Y ij = b 0j + b i x ,j + £ ij 
b 0j = b o + “Oj 


(19.3) 


Therefore, if we want to know the estimated intercept for clinic 7, we simply replace the j 
with ‘clinic 7’ in the second equation: 


^OClinic 7 


= b a + u. 


OClinic 7 


If we want to include random slopes for the effect of surgery on quality of life, then all we 
do is add a component to the slope of the overall model that measures the variability in 
slopes, Uy. Therefore, the gradient changes from b 1 to become ( b x + m 1; ). This term estimates 
the slope of the overall model fitted to the data, b v and the variability of slopes in different 
contexts around that overall model, u r . The overall model becomes (compare to the random 
intercept model above): 

Y.j = b 0 +(b 1 +u lj )X ij + e ij (19.4) 


Again we can take the error terms out into a separate equation to make the link to a famil¬ 
iar linear model even clearer. It now looks like an ordinary regression equation except that 
the slope has changed from a fixed, b v to a random one, b lj} which is defined in a separate 
equation: 


Y ij = b 0i + b i j X ij + £ ,j 


b i j = h 1 + u 1 


(19.5) 


If we want to model a situation with random slopes and intercepts, then we combine the 
two models above. We still estimate the intercept and slope of the overall model ( b g and bj 
but we also include the two terms that estimate the variability in intercepts, u Qj , and slopes, 
Uy. The overall model becomes (compare to the two models above): 

Y i) = ( b o + "Oj ) + ( b i + u i i ) X ij + £ a (19.6) 

We can link this more directly to a simple linear model if we take some of these extra terms 
out into separate equations. We could write this model as a basic linear model, except 
we’ve replaced our fixed intercept and slope (hgandhj) with their random counterparts 
( b 0j and by ) : 



b 0j + b lj X ij + £ ,j 
b o + u o j 


b i + u \j 


(19.7) 


The take-home point is that we’re not doing anything terribly different from the rest of the 
book: it’s basically just a posh regression. 
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Now imagine we wanted to add in another predictor, for example quality of life before 
surgery. Knowing what we do about multiple regression, we shouldn’t be invading the 
personal space of the idea that we can simply add this variable in with an associated beta: 

QoL After Surgery, = b 0 + hjSurgery, + b 2 QoL Before Surgery, + e, (19.8) 

This is all just revision of ideas from earlier in the book. Remember also that the i repre¬ 
sents the level 1 variable, in this case the people we tested. Therefore, we can predict a 
given person’s quality of life after surgery by replacing the i with their name: 

QoL After Sam = b 0 + (^Surgery^ + b 2 QoL Before Sam + e Sam 

Now, if we want to allow the intercept of the effect of surgery on quality of life after 
surgery to vary across contexts then we simply replace b Q with b (y . If we want to allow the 
slope of the effect of surgery on quality of life after surgery to vary across contexts then 
we replace b 1 with . So, even with a random intercept and slope, our model stays much 
the same: 


QoL After, ; = h 0/ - + h 1; Surgery /; - + fe 2 QoLBefore /; + e V) 


b 

b 


0 j - b 0 + u 0j 
1; = b l + u \ j 


(19.9) 


Remember that the j in the equation relates to the level 2 contextual variable (clinic in this 
case). So, if we wanted to predict someone’s score we wouldn’t just do it from their name, 
but also from the clinic they attended. Imagine our guinea pig Sam had her surgery done at 
clinic 7, then we could replace the is and js as follows: 

QoL After Surgery SamCIinic7 

= Vlinic? + fo lClinic7 Sur g er yS am, Clinic7 
+ b,QoL Before Surgery Sam! Clinic7 + £ Sa m,ciinic7 


I want to sum up by just reiterating that all we’re really doing in a multilevel model is a 
fancy regression in which we allow either the intercepts or slopes, or both, to vary across 
different contexts. All that really changes is that for every parameter that we allow to be 
random, we get an estimate of the variability of that parameter as well as the parameter 
itself. So, there isn’t anything terribly complicated; we can add new predictors to the model 
and for each one decide whether its regression parameter is fixed or random. 


19 . 4 . 1 . 


Assessing the fit and comparing multilevel models 


As in logistic regression (Chapter 8) the overall fit of a multilevel model is tested using a 
chi-square likelihood ratio test (see section 18.4.3) and R reports the — 2log-likelihood 
(-2LL, see section 8.3.1). Essentially, the smaller the value of the log-likelihood, the bet¬ 
ter. R also produces two adjusted versions of the log-likelihood value, both of which were 
described briefly in section 7.6.3. Both of these can be interpreted in the same way as the 
log-likelihood, but they have been corrected for various things: 
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• Akaike’s information criterion (AIC): This is basically a goodness-of-fit measure that is 
corrected for model complexity. That just means that it takes into account how many 
parameters have been estimated. 

• Schwarz’s Bayesian criterion (BIC): This statistic is comparable to the AIC, although it 
is slightly more conservative (it corrects more harshly for the number of parameters 
being estimated). It should be used when sample sizes are large and the number of 
parameters is small. 

Neither the AIC or BIC are intrinsically interpretable (it’s not meaningful to talk about 
their values being large or small per se); however, they are useful as a way of comparing 
models. The value of AIC and BIC can be compared to their equivalent values in other 
models. In all cases smaller values mean better-fitting models. 

Many writers recommend building up multilevel models starting with a ‘basic’ model in 
which all parameters are fixed and then adding in random coefficients as appropriate and 
exploring confounding variables (Raudenbush &C Bryk, 2002; Twisk, 2006). One advan¬ 
tage of doing this is that you can compare the fit of the model as you make parameters ran¬ 
dom, or as you add in variables. To compare models we simply subtract the log-likelihood 
of the new model from the value for the old: 

Xchange = (-2Log-Likelihood old ) - (-2Log-Likelihood New ) 

df change = NumberofParametersoid -Numberof Parameters New (19.10) 

This equation is the same as equations (18.5) and (8.6). There are two caveats to this equation: 
(1) it works only if full maximum-likelihood estimation is used (and not restricted maximal 
likelihood - see R’s Souls’ Tip 19.1); and (2) the new model must contain all of the effects 
of the older model. 


19 . 4 . 2 . 


Types of covariance structures © 


If you have any random effects or repeated measures in your multilevel model then you 
have to decide upon the covariance structure of your data. If you have random effects 
and repeated measures then you can specify different covariance structures for each. 
The covariance structure simply specifies the form of the variance-covariance matrix 
(a matrix in which the diagonal elements are variances and the off-diagonal elements 
are covariances). There are various forms that this matrix could take and we have to 
tell R what form we think it does take. Of course we might not know what form it 
takes (most of the time we’ll be taking an educated guess), so it is sometimes useful to 
run the model with different covariance structures defined and use the goodness-of-fit 
indices (the AIC and BIC) to see whether changing the covariance structure improves 
the fit of the model (remember that a smaller value of these statistics means a better¬ 
fitting model). 

The covariance structure is important because R uses it as a starting point to estimate the 
model parameters. As such, you will get different results depending on which covariance 
structure you choose. If you specify a covariance structure that is too simple then you are 
more likely to make a Type I error (finding a parameter is significant when in reality it is 
not), but if you specify one that is too complex then you run the risk of a Type II error 
(finding parameters to be non-significant when in reality they are). R can implement many 
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different covariance structures. We will look at four of the commonest covariance struc¬ 
tures to give you a feel for what they are and when they should be used. In each case I use 
a representation of the variance-covariance matrix to illustrate. With all of these matrices 
you could imagine that the rows and columns represents four different clinics in our cos¬ 
metic surgery data: 


"1 0 0 (T 
0 10 0 
0 0 10 
^0 0 0 1 , 


'a\ 0 0 O' 

0 a\ 0 0 

0 0 a\ 0 

y 0 0 0 n\ j 


Variance components: This covariance structure is very simple 
and assumes that all random effects are independent (this is 
why all of the covariances in the matrix are 0). Variances of 
random effects are assumed to be the same (hence why they are 
1 in the matrix) and sum to the variance of the outcome vari¬ 
able. This covariance structure is sometimes called the inde¬ 
pendence model. 

Diagonal: This variance structure is like variance components 
except that variances are assumed to be heterogeneous (this is 
why the diagonal of the matrix is made up of different vari¬ 
ance terms). This structure again assumes that variances are 
independent and, therefore, that all of the covariances are 0. 


1 p p 2 
pip 
P 2 P 1 

y p 2 p 

between time points 1 and 2 is p; let’s assume that this value is 
.3. As we move to time point 3, the correlation between time 
point 1 and 3 is p 2 , or .09. In other words, it has decreased: 
scores at time point 1 are more related to scores at time 2 
than they are to scores at time 3. At time 4, the correlation 
goes down again to p 3 or .027. So, the correlations between 
time points next to each other are assumed to be p, scores 
two intervals apart are assumed to have correlations of p 2 , and 
scores three intervals apart are assumed to have correlations of 
p 3 . So the correlation between scores gets smaller over time. 
Variances are assumed to be homogeneous, but there is a ver¬ 
sion of this covariance structure where variance can be het¬ 
erogeneous. This structure is often used for repeated-measures 
data (especially when measurements are taken over time such 
as in growth models). 


A3 


P 

P 

1 


AR(1): This stands for first-order autoregressive structure. In 
layman’s terms this means that the relationship between vari¬ 
ances changes in a systematic way. If you imagine the rows and 
columns of the matrix to be points in time, then it assumes 
that the correlation between repeated measurements is highest 
at adjacent time points. So, in the first column, the correlation 


2 



°21 

ct 21 

2 

°2 

°31 

°32 

°41 

°42 


°31 

°41 

°32 

°42 


ct 43 

°43 

CT4 


Unstructured: This covariance structure is completely general. 
Covariances are assumed to be completely unpredictable: they 
do not conform to a systematic pattern. 
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CRAMMING SAM’S TIPS 


Multilevel models 


• Multilevel models should be used to analyse data that have a hierarchical structure. For example, you might measure depres¬ 
sion after psychotherapy. In your sample, patients will see different therapists within different clinics. This is a three-level 
hierarchy with depression scores from patients (level 1), nested within therapists (level 2) who are themselves nested within 
clinics (level 3). 

• Hierarchical models are just like regression, except that you can allow parameters to vary (this is called a random effect). In 
ordinary regression, parameters generally are a fixed value estimated from the sample (a fixed effect). 

• If we estimate a linear model within each context (the therapist or clinic, to use the example above) rather than the sample as 
whole, then we can assume that the intercepts of these models vary (a random intercepts model), or that the slopes of these 
models differ (a random slopes model) or that both vary. 

• We can compare different models (assuming that they differ in only one additional parameter) by looking at the difference in 
the -ILL. Usually we would do this when we have changed only one parameter (added one new thing to the model). 

• For any model we have to assume a covariance structure. For random intercepts models the default of variance components 
is fine, but when slopes are random an unstructured covariance structure is often assumed. When data are measured over 
time an autoregressive structure (AR(1)) is often assumed. 


19.5. Some practical issues © 


19 . 5 . 1 . 


Assumptions © 


Multilevel linear models are an extension of regression, so all of the assumptions for regres¬ 
sion apply to multilevel models (see section 7.7.2). There is a caveat, though, which is that 
the assumptions of independence and independent errors can sometimes be solved by a 
multilevel model because the purpose of this model is to factor in the correlations between 
cases caused by higher-level variables. As such, if a lack of independence is being caused 
by a level 2 or level 3 variable then a multilevel model should make this problem go away 
(although not always). As such, try to check the usual assumptions in the usual way. 

There are two additional assumptions in multilevel models that relate to the random 
coefficients. These coefficients are assumed to be normally distributed around the overall 
model. So, in a random intercepts model the intercepts in the different contexts are assumed 
to be normally distributed around the overall model. Similarly, in a random slopes model, 
the slopes of the models in different contexts are assumed to be normally distributed. 

Also it’s worth mentioning that multicollinearity can be a particular problem in mul¬ 
tilevel models if you have interactions that cross levels in the data hierarchy (cross-level 
interactions). However, centring predictors can help matters enormously (Kreft & de 
Leeuw, 1998), and we will see how to centre predictors in section 19.5.3. 


19 . 5 . 2 . 


Sample size and power © 


As you might well imagine, the situation with power and sample size is very complex 
indeed. One complexity is that we are trying to make decisions about our power to detect 
both fixed and random effects coefficients. Kreft and de Leeuw (1998) do a tremendous 
job of making sense of things for us. Essentially, the take-home message is the more data, 
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the better. As more levels are introduced into the model, more parameters need to be esti¬ 
mated and the larger the sample sizes need to be. Kreft and de Leeuw conclude that if you 
are looking for cross-level interactions then you should aim to have more than 20 contexts 
(groups) in the higher-level variable, and that group sizes ‘should not be too small’. They 
conclude by saying that there are so many factors involved in multilevel analysis that it is 
impossible to produce any meaningful rules of thumb. 

Twisk (2006) agrees that the number of contexts relative to individuals within those 
contexts is important. He also points out that standard sample size and power calculations 
can be used but then ‘corrected’ for the multilevel component of the analysis (by factoring, 
among other things, the intraclass correlation). However, there are two corrections that he 
discusses that yield very different sample sizes! He recommends using sample size calcula¬ 
tions with caution. 


19 . 5 . 3 . 


Centring variables © 


Centring refers to the process of transforming a variable into deviations 
around a fixed point. This fixed point can be any value that you choose, 
but typically we use the grand mean. We have already come across a form 
of centring way back in Chapter 1, when we discovered how to compute 
Z-scores. When we calculate a z-score we take each score and subtract from 
it the mean of all scores (this centres the values at 0), and then divide by the 
standard deviation (this changes the units of measurement to standard devia¬ 
tions). When we centre a variable around the mean we simply subtract the 
mean from all of the scores: this centres the variables around 0. 

There are two forms of centring that are typically used in multilevel modelling: grand 
mean centring and group mean centring. Grand mean centring means that for a given vari¬ 
able we take each score and subtract from it the mean of all scores (for that variable). 
Group mean centring means that for a given variable we take each score and subtract 
from it the mean of the scores (for that variable) within a given group. In both cases it 
is usually only level 1 predictors that are centred (in our cosmetic surgery example this 
would be predictors such as age, BDI and pre-surgery quality of life). If group mean 
centring is used then a level 1 variable is typically centred around means of a level 2 
variable (in our cosmetic surgery data this would mean that, for example, the age of a 
person would be centred around the mean age for the clinic at which the person had 
their surgery). 

Centring can be used in ordinary multiple regression too, and because this form of 
regression is already familiar to you I’d like to begin by looking at the effects of centring in 
regression. In multiple regression the intercept represents the value of the outcome when 
all of the predictors take a value of 0. There are some predictors for which a value of 0 
makes little sense. For example, if you were using heart rate as a predictor variable then a 
value of 0 would be meaningless (no one will have a heart rate of 0 unless they are dead). 
As such, the intercept in this case has no real-world use: why would you want to know the 
value of the outcome when heart rate was 0 given than no alive person would even have 
a heart rate that low? Centring heart rate around its mean changes the meaning of the 
intercept. The intercept becomes the value of the outcome when heart rate is its average 
value. In more general terms, if all predictors are centred around their mean then the inter¬ 
cept is the value of the outcome when all predictors are the value of their mean. Centring 
can, therefore, be a useful tool for interpretation when a value of 0 for the predictor is 
meaningless. 

The effect of centring in multilevel models, however, is much more complicated. There 
are some excellent reviews that look in detail at the effects of centring on multilevel models 
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(Enders & Tofighi, 2007; Kreft & de Leeuw, 1998; Kreft, de Leeuw, & Aiken, 1995), and 
here I will just give a very basic precis of what they say. Essentially if you fit a multilevel 
model using the raw score predictors and then fit the same model but with grand mean 
centred predictors then the resulting models are equivalent. By this, I mean that they will 
fit the data equally well, have the same predicted values, and the residuals will be the same. 
The parameters themselves (the bs) will, of course, be different but there will be a direct 
relationship between the parameters from the two models (i.e., they can be directly trans¬ 
formed into each other). Therefore, grand mean centring doesn’t change the model, but it 
would change your interpretation of the parameters (you can’t interpret them as though 
they are raw scores). When group mean centring is used the picture is much more compli¬ 
cated. In this situation the raw score model is not equivalent to the centred model in either 
the fixed part or the random part. One exception is when only the intercept is random 
(which arguably is an unusual situation), and the group means are reintroduced into the 
model as level 2 variables (Kreft & de Leeuw, 1998). 

The decision about whether to centre or not is quite complicated and you really need 
to make the decision yourself in a given analysis. Centring can be a useful way to combat 
multicollinearity between predictor variables. It’s also helpful when predictors do not have 
a meaningful zero point. Finally, multilevel models with centred predictors tend to be 
more stable, and estimates from these models can be treated as more or less independent 
of each other, which might be desirable. If group mean centring is used then the group 
means should be reintroduced as a level 2 variable unless you want to look at the effect 
of your ‘group’ or level 2 variable uncorrected for the mean effect of the centred level 1 
predictor, such as when fitting a model when time is your main explanatory variable (Kreft 
& de Leeuw, 1998). 

The question arises of whether grand mean or group mean centring is ‘better’. People 
doing statistics often fixate on their being a ‘best’ way to do things, but the ‘best’ method 
often depends on what it is that you’re actually trying to do. Centring is a good example. 
Some people make a decision about whether to use group or grand mean centring based on 
some statistical criterion; however, there is no statistically correct choice between not cen¬ 
tring, group mean centring and grand mean centring (Kreft et ah, 1995). Instead, Enders 
and Tofighi (2007) recommend making decisions based on the substantive research ques¬ 
tion. In short, they make four recommendations when analysing data with a two-level 
hierarchy: (1) group mean clustering should be used if the primary interest is in an associa¬ 
tion between variables measured at level 1 (i.e., the aforementioned relationship between 
surgery and quality of life after surgery); (2) grand mean centring is appropriate when the 
primary interest is in the level 2 variable but you want to control for the level 1 covariate 
(i.e., you want to look at the effect of clinic on quality of life after surgery while controlling 
for the type of surgery); (3) both types of centring can be used to look at the differential 
influence of a variable at level 1 and 2 (i.e., is the effect of surgery on quality of life post¬ 
surgery different at the clinic level to the client level?); and (4) group mean centring is 
preferable for examining cross-level interactions (e.g., the interactive effect of clinic and 
surgery on quality of life after surgery). 





OLIVER TWISTED 

Please Sir, can I have 
some more ... centring? 


‘Recentgin’, babbles Oliver as he stumbles drunk out of Mrs 
Moonshine’s alcohol emporium. ‘I’ve had some recent gin.' I 
think you mean centring , Oliver, not recentgin. If you want to 
know how to centre your variables using R, then the additional 
material for this chapter on the companion website will tell you. 
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19.6. Multilevel modelling in R © 


Multilevel modelling can be done with specialist software such as MLwiN and HLM. 
There are several excellent books that compare R with various other packages (Tabachnick 
& Fidell, 2001; Twisk, 2006). R is more versatile than packages such as SPSS in that it can 
do multilevel modelling when the outcome variable is categorical. However, the packages 
that do multilevel models in R do not currently produce bootstrap estimates of the model 
parameters, and these can be a very useful way to circumvent pesky distributional assump¬ 
tions (see section 5.8.4). 

We saw in section 19.4.1 that it is useful to build up models starting with a ‘basic’ model 
in which all parameters are fixed and then add random coefficients as appropriate before 
exploring confounding variables. We will take this approach in our example. 


19 . 6 . 1 . 


Packages for multilevel modelling in R © 


There are several packages that can be used for multilevel models. Two of the most used 
are: nlme (Pinheiro, Bates, DebRoy, Sarkar, & R Development Core Team, 2010) and lme4 
(Bates & Maechler, 2010). I am going to focus on the package nlme (non linear mixed 
effect) because, unlike lme4, it enables you to model the covariance structure, which will be 
useful when we come to look at growth models towards the end of the chapter. 

For the examples in this chapter you will need the packages car (to recode variables), nlme 
(for the multilevel analysis), ggplot2 (for graphs), and reshape (to restructure the data). If you do 
not have these packages installed, you can install them by executing the following commands: 

install.packages("car"); install.packages("ggplot2"); install. 
packagesC'nlme"); install.packages("reshape") 

You then need to load these packages by executing the commands: 

library(car); library(ggplot2); library(nlme); library(reshape) 


19 . 6 . 2 . 


Entering the data © 


Data entry depends a bit on the type of multilevel model that you wish to run: the data 
layout is slightly different when the same variables are measured at several points in time. 
However, we will look at the case of repeated-measures data in a second example. In this 
first example, the situation we have is very much like multiple regression in that data from 
each person who had surgery are not measured over multiple time points. Figure 19.7 
shows the data layout. Each row represents a case of data (in this case a person who had 
surgery). Their scores on the various variables are simply entered in different columns. So, 
for example, the first person was 31 years old, had a BDI score of 12, was in the waiting 
list control group (Surgery = 0) at clinic 1, was female (Gender = 0) and was waiting for 
surgery to change her appearance (Reason = 0). 

To access these data we need to create a dataframe, which I have called surgeryData, 
that contains the data from the file CosmeticSurgery.dat. This file stores the data as tab- 
delimited text, so we can import it into the dataframe using the following command (I’m 
assuming as always that you have set the working directory to be where the file is stored): 

surgeryData = read.delim("Cosmetic Surgery.dat", header = TRUE) 
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FIGURE 19.7 

Data layout 
for multilevel 
modelling with no 
repeated measure 
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19 . 6 . 3 . 


Picturing the data © 


Before we begin the analysis it’s a good idea to have a look at the data. Our main example looks 
at Surgery and baseline quality of life (Base_QoL) as a predictors of quality of life after surgery 
(Post_QoL). Remember that the surgery was conducted at one of 10 clinics. Therefore, to begin 
with we could simply look at the relationship between baseline quality of life and post-surgery 
quality of life separately for the two surgery conditions (cosmetic surgery vs. waiting list). We 
might also want to graph this separately for the 10 clinics. We can use what we learnt about 
ggplotl in Chapter 4 to produce this plot; the resulting graph is shown in Figure 19.8. 



SELF-TEST 

s Using what you know about ggplot2, produce 
the graph described above. Display the levels of 
Surgery_Text in colours, and use Clinic to produce 
different graphs within a grid. 


19 . 6 . 4 . 


Ignoring the data structure: ANOVA © 


First of all, let’s ground the example in something very familiar to us: ANOVA. Let’s say for 
the time being that we were interested only in the effect that surgery has on post-operative 
quality of life. We could analyse this with a simple one-way independent ANOVA (or 
indeed a t-test), and the model is described by equation (19.1). 
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Quality of Life Pre-Post Surgery at 10 Clinics 
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FIGURE 19.8 

Graph of the relationship between baseline and post-surgery quality of life for people who had cosmetic surgery 
compared to those on the waiting list at 10 different clinics. 



SELF-TEST 

s Using what you know about ANOVA, conduct a one¬ 
way ANOVA using Surgery as the predictor and 
Post_QoL as the outcome. 



In reality we wouldn’t do an ANOVA, I’m just using it as a way of showing you that mul¬ 
tilevel models are not big and scary, but are simply extensions of what we have done before. 
Output 19.1 shows the results of the ANOVA that you should get if you did the self-test. We 
find a non-significant effect of surgery on quality of life, F(l, 274) = 0.33, p > .05. 

Df Sum Sq Mean Sq F value Pr(>F) 

Surgery 1 28.6 28.620 0.3302 0.566 

Residuals 274 23747.9 86.671 

Output 19.1 

We have also seen that we can think of ANOVA as a general linear model in which 
an outcome (in this case Post_QoL) is predicted from group membership (in this case 
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Surgery). Therefore, we could fit the same model but using the lm() function that we first 
encountered in Chapter 7. 

surgeryLinearModel<-lm(Post_QoL ~ Surgery, data = surgeryData) 
summary(surgeryLinearModel) 

We have used the function lm (linear model) to create an object called surgeryLinear- 
Model. The commands in brackets tell lm() what model we want to fit; as we have seen 
elsewhere in this book, the means ‘predicted from’. So, we have specified that we 
want Post_QoL predicted from Surgery. In other words, we have simply written out the 
linear model in equation (19.1) but without the bs. The rest of the options simply tell 
lm() to fit the model on the dataframe that we just created (data = surgeryData). Finally 
summary (surgery LinearModel) prints the model parameters to the R console. 

Output 19.2 shows the main table for the model. Compare this table with Output 19.1 
and you’ll see that there is basically no difference: we get a non-significant effect of surgery 
with an F of 0.33, and a p of .56. The point I want you to absorb here is that if we ignore 
the hierarchical structure of the data then what we are left with is something very familiar: 
an ANOVA/regression. The numbers are more or less exactly the same; all that has changed 
is that we have used different commands to get to the same end point. 


Coefficients: 




Estimate 

Std. Error 

t value 

Pr(> t ) 

(Intercept) 


59.2710 

0.8134 

72.869 

<2e-16 

Surgery[T.Waiting 

List] 

0.6449 

1.1222 

0.575 

0.566 

Signif. codes: 0 


0.001 '** 

* 

s—1 

O 

O 

LD 

O 

O 

1 0.1 ' 

Residual standard 

error: 

: 9.31 on 

274 degrees 

of freedom 


Multiple R-squared: 0.001204, Adjusted R-squared: -0.002442 
F-statistic: 0.3302 on 1 and 274 DF, p-value: 0.566 


Output 19.2 


19 . 6 . 5 . 


Ignoring the data structure: ANC0VA © 


We have seen that there is no effect of cosmetic surgery on quality of life, but we did not 
take into account the quality of life before surgery. Let’s, therefore, extend the example 
a little to look at the effect of the surgery on quality of life while taking into account the 
quality of life scores before surgery. Our model is now described by equation (19.8). You 
could do this analysis with an ANCOVA, using the aov() function or as a linear model using 
the lm() function. As in the previous section we’ll run the analysis both ways, just to illus¬ 
trate that we’re doing the same thing when we run a hierarchical model. 




SELF-TEST 

V Using what you know about ANCOVA, conduct a 
one-way ANCOVA using Surgery as the predictor, 
Post_QoL as the outcome and Base_QoL as the 
covariate. 
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Output 19.3 shows the results of the ANCOVA that you should get if you did the self¬ 
test. The top output shows the Type I sums of squares, whereas the bottom is the same 
model but with Type III sums of squares (they differ slightly for baseline quality of life 
because we have an unbalanced design). With baseline quality of life included we find a 
significant effect of surgery on quality of life, F(l, 273) = 4.04, p < .05. Baseline quality of 
life also predicted quality of life after surgery, F(l, 273) = 214.89, p < .001. 

Df Sum Sq Mean Sq F value Pr(>F) 

Base_QoL 1 10291.4 10291.4 211.4321 < 2e-16 *** 

Surgery 1 196.8 196.8 4.0435 0.04533 * 

Residuals 273 13288.3 48.7 

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 0.1 ' ' 1 

Model: 


Post_QoL 

~ Base_QoL + 

Surgery 




Df Sum of Sq 

RSS 

AIC 

F value 

Pr (F) 

<none> 


13288 

1075.3 



Base_QoL 

1 10459.6 

23748 

1233.5 

214.8876 

< 2e-16 

Surgery 

1 196.8 

13485 

1077.3 

4.0435 

0.04533 

Signif. 

codes: 0 '*** 

' 0.001 '**' 

0.01 '*' 

m 

o 

o 


Output 19.3 

We can also think of ANCOVA within the general linear model framework: the outcome 
(in this case Post_QoL) is predicted from group membership (in this case Surgery) and the 
covariate (Base_QoL). As with before the covariate was added, we can fit the model but 
using the lm() function by simply adding the baseline quality of life variable to the equation. 

surgeryLinearModel<-lm(Post_QoL ~ Surgery + Base_QoL, data = surgeryData) 
summary(surgeryLinearModel) 

As before, the lm() function creates an object called surgeryLinearModel. We have speci¬ 
fied that Post_QoL is the outcome variable and that it is predicted from (the ~ symbol) 
Surgery and baseline quality of life (Base_QoL). Again, notice how the Post_QoL~Surgery 
+ Base_QoL is basically just equation (19.8) without the bs. The rest of the function 
specifies the data (data = surgeryData ) and prints a summary of the model (summary 
(surgery Lin earModel )). 

Output 19.4 shows the main table for the model. Compare this table with Output 19.3 
and you’ll see that again there is no difference: we get a significant effect of surgery with 
a t of -2.011, p < .05, and a significant effect of baseline quality of life with a t of 14.66, 
p < .001. We can also see that the regression coefficient for surgery is —1.70. 

Hopefully this exercise, as well as being good revision, has convinced you that we’re just 
doing a regression here, something you have been doing throughout this book. Multilevel 
models are not radically different, and if you think about it as just an extension of what you 
already know, then it’s really relatively easy to understand. So, having shown you that we 
can do basic analyses through the linear models function, let’s now use its power to factor 
in the hierarchical structure of the data. 

Call: 

lm(formula = Post_QoL ~ Surgery + Base_QoL, data = surgeryData) 
Residuals: 

Min IQ Median 3Q Max 

-13.4142 -5.1326 -0.6495 4.0540 23.5005 
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Coefficients: 

Estimate Std. Error t value Pr(>|t|) 

(Intercept) 18.14702 2.90767 6.241 1.65e-09 *** 

Surgery -1.69723 0.84404 -2.011 0.0453 * 

Base_QoL 0.66504 0.04537 14.659 < 2e-16 *** 

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 0.1 ' ' 1 

Residual standard error: 6.977 on 273 degrees of freedom 
Multiple R-squared: 0.4411, Adjusted R-squared: 0.437 

F-statistic: 107.7 on 2 and 273 DF, p-value: < 2.2e-16 

Output 19.4 

To sum up, we have seen that when we factor in the pre-surgery quality of life scores, 
which themselves significantly predict post-surgery quality of life scores, surgery seems to 
positively affect quality of life. However, at this stage we have ignored the fact that our 
data have a hierarchical structure. Essentially we have violated the independence assump¬ 
tion because scores from people who had their surgery at the same clinic are likely to be 
related to each other (and certainly more related than with people at different clinics). We 
have seen that violating the assumption of independence can have some quite drastic con¬ 
sequences (see section 10.3). However, rather than just panic and gibber about our F-ratio 
being inaccurate, we can model this covariation within clinics explicitly by including the 
hierarchical data structure in our analysis. 


19 . 6 . 6 . 


Assessing the need for a multilevel model <D 


The first step in a multilevel analysis such as this is to assess the need to do it in the first 
place. If there is not significant variation across contexts in the first place then doing a mul¬ 
tilevel model is simply a perverse exercise in mental flagellation. If there is little evidence of 
variation across contexts then save yourself a lot of pain and just do a regression/ANOVA/ 
whatever variant of the general linear model that you feel like doing. 

Ascertaining whether there is variation over your contexts is fairly straightforward. First, 
we need to fit a baseline model in which we include only the intercept; next, we fit a model 
that allows intercepts to vary over contexts; finally we compare these two models to see 
whether the fit has improved as a result of allowing intercepts to vary. If it has, we jump 
on the runaway train to multilevel insanity; if it has not, we do a little dance of joy into the 
loving arms of a simpler life. 

In our surgery example then, we first need to get R to fit a baseline model that includes 
only the intercept. This is done using the gls() function (generalized least squares ). 4 

interceptOnly <-g!s(Post_QoL ~ 1, data = surgeryData, method = "ML") 
summary(interceptOnly) 

The format of the gls() function is very much like the lm() function that we have encoun¬ 
tered before. In this example, we have asked R to create an object called interceptOnly, and 
we have specified that Post_QoL is the outcome variable and that it is predicted from (~) 


4 You might wonder why we don’t simply use the lm() command. To compare models we need them to be 
computed in the same way. Multilevel models are estimated using maximum likelihood methods and generalized 
least squares use this method too, so we can compare these models. However, lm() used ordinary least squares 
methods and so can’t be compared to a model estimated using maximum likelihood. 
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only the intercept (the ‘1’ in the function translates as ‘intercept’). The rest of the function 
specifies the data {data = surgeryData) and how to estimate the model {method = “ML”). 
The method option is very important. If you do not include it (i.e., use the default option), 
then R will use restricted maximum-likelihood methods (these can also be applied expli¬ 
citly by writing method = “REML”). However, we have chosen to use maximum-likelihood 
estimation {method = “ML”). There are pros and cons to both (see R’s Souls’ Tip 19.1) but 
if you want to compare models as you build them up, you should use maximum-likelihood 
estimation. The final option {summary(interceptOnly)) is optional and prints the summary 
of the model shown in Output 19.5. 



Estimation (D 


R gives you the choice of two methods for estimating the parameters in the analysis: maximum likelihood (ML), 
which we have encountered before, and restricted maximum likelihood (REML). The conventional wisdom seems 
to be that ML produces more accurate estimates of fixed regression parameters, whereas REML produces more 
accurate estimates of random variances (Twisk, 2006). As such, the choice of estimation procedure depends 
on whether your hypotheses are focused on the fixed regression parameters or on estimating variances of the 
random effects. However, in many situations the choice of ML or REML will make only small differences to the 
parameter estimates. Also, if you want to compare models you must use ML. 


Generalized least squares fit by maximum likelihood 
Model: Post_QoL - 1 
Data: surgeryData 

AIC BIC logLik 

2017.124 2024.365 -1006.562 

Coefficients: 

Value Std.Error t-value p-value 
(Intercept) 59.60978 0.5596972 106.5036 0 

Standardized residuals: 

Min Q1 Med Q3 Max 

-2.1127754 -0.7875625 -0.1734394 0.7962286 3.0803354 

Residual standard error: 9.281527 

Degrees of freedom: 276 total; 275 residual 

Output 19.5 

Next, we need to fit the same model, but this time allowing the intercepts to vary across 
contexts, in this case we want them to vary across clinics. We do this by using the Ime 
(linear mixed effect) function. In fact, this is the function that you’ll use throughout the 
rest of the chapter. The format of this function is much the same as lm() and gls() m , the 
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only difference is that we need to specify the random part of the model using the option 
random = x \ y, in which x is an equation specifying the random parts of the model and y is 
the contextual variable or variables across which we want to model variance. In the current 
example, we are trying to model intercepts that vary across clinics; therefore, we could add 
the instruction random = ~1 \ Clinic. Remember that we use ‘1’ to denote the intercept, 
and that Clinic is the variable that contains information about the clinic that a given person 
attended. The resulting command is: 

randomlnterceptOnly <-lme(Post_QoL ~ 1, data = surgeryData, random = 

-llClinic, method = "ML") 

summary(randomlnterceptOnly) 

As before, executing this command creates a model (this time I’ve called it random¬ 
lnterceptOnly), that predicts post-surgery quality of life from only the intercept ( Post_ 
QoL~l), but also allows intercepts to vary across clinics ( random = ~l\Clinic). We 
have again asked for maximum-likelihood estimation ( method = “ML”). You can use 
summary (randomlnterceptOnly) to view the model summary (Output 19.6). 

Linear mixed-effects model fit by maximum likelihood 
Data: surgeryData 

AIC BIC logLik 

1911.473 1922.334 -952.7364 

Random effects: 

Formula: ~1 | Clinic 

(Intercept) Residual 
StdDev: 5.909691 7.238677 

Fixed effects: Post_QoL ~ 1 

Value Std.Error DF t-value p-value 
(Intercept) 60.08377 1.923283 266 31.24022 0 

Standardized Within-Group Residuals: 

Min Q1 Med Q3 Max 

-1.8828507 -0.7606631 -0.1378732 0.7075242 2.8607949 

Number of Observations: 276 
Number of Groups: 10 

Output 19.6 

To see whether allowing the intercepts to vary improves the model we can do several 
things. First, we can compare the fit of the model using indices such as AIC and BIC. If you 
compare Output 19.5 with Output 19.6 you’ll see that the BIC when only the intercept 
was included is 2024.37 but decreases to 1922.33 when intercepts are allowed to vary. 
Remember that smaller values of BIC indicate a better fit of the data, so this gives us an indi¬ 
cation that by allowing intercepts to vary the model fit has improved (BIC has decreased). 
This is all very well, but it does not give us an objective answer to whether the improvement 
in fit is ‘significant’ or big enough for us to continue down the multilevel path. 

The second thing we can do is test the change in the —ILL (equation (19.10)). We saw 
earlier that to be able to do this (1) full maximum-likelihood estimation must be used (and 
we have used this method); and (2) the new model contains all of the effects of the older 
model (this is true also, the models are identical except that we added a parameter reflect¬ 
ing the variability in intercepts across clinics). The log-liklihood is given in the outputs for 
each model and we could simply multiply these values by —2 to get the —2LL. We can 
also get R to do this for us for each model, and this has an advantage in that it will also tell 
us the degrees of freedom on which each log-liklihood is based. We can use the function 
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logLik() for each model and then type ‘*—2’ to multiply by —2. To obtain the —2 LL for 
our two models, we would therefore type: 

logLik(intercept0nly)*-2 

logLik(randomIntercept0nly)*-2 

The resulting output should reveal that the model with random intercepts has a —2 LL of 
2013.12 based on three degrees of freedom, whereas the model with only the intercept had 
a —2 LL of 19.05.47, based on two degrees of freedom. Therefore: 

Xchange= 2 °13-12-1905.47 = 107.65 

^change = 3 " 2 = 1 


If we look at the critical values for the chi-square statistic with 1 degree of freedom in 
the Appendix, they are 3.84 (p < .05) and 6.63 (p < .01); therefore, this change is highly 
significant. 

A simpler way to do much the same thing is to use the anova() function (section 7.8.4). 
For these types of models, this function compares the change in —2 LL without you hav¬ 
ing to compute it and produces an exact significance for this change. The same caveats 
as before apply: you should have used maximum likelihood and the models should be 
nested (that is, models higher up the chain need to contain all of the effects that were 
in models earlier in the chain). We can use this function as follows to compare our two 
models: 

anova(interceptOnly, randomlnterceptOnly) 

The resulting Output 19.7 shows the fit indices for each model, but most important 
shows the value of change in the —2LL, the likelihood ratio, that we computed above (we 
can feel fairly smug that the value of 107.65 matches our earlier calculations). It also shows 
the degrees of freedom for each model (so that we can verify that the change in degrees of 
freedom is 1 as we previously calculated). Finally, it shows a p-value, which is highly signifi¬ 
cant and verifies our earlier conclusion that it is important that we model the variability in 
intercepts because when we do the fit of our model is significantly improved. The change 
in the -2 LL has a chi-square distribution, so we can report this statistic in the normal way, 
X 2 (l) = 107.65, p < .0001. We can conclude then that the intercepts vary significantly 
across the different clinics. Multilevel madness must ensue. 

Model df AIC BIC logLik Test L.Ratio p-value 

1 2 2017.12 2024.36 -1006.56 

2 3 1911.47 1922.33 -952.73 1 vs 2 107.6517 <.0001 

Output 19.7 


19 . 6 . 7 . 


Adding in fixed effects (D 


We have seen that intercepts vary significantly across clinics. The model that we currently 
have is this: 

QoL After Surgery = £> 0/ - + e i; - 
b oj =b 0 + u 0j 
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However, we originally had hypotheses about how surgery and baseline quality of life will 
affect quality of life after surgery. Now that we have a baseline model with random inter¬ 
cepts, we can start to build up the final model by adding these predictors. Let’s first add 
in the Surgery variable, which defines whether a person had surgery or was on the waiting 
list. Our model now becomes: 

QoL After Surgery= b 0 j + ^Surgery + e ;; 


To add this predictor we again create a new object in R (which I have called random- 
InterceptSurgery) using the lme() function. This function is exactly the same as before 
except that we have replaced Post_QoL~l with Post_QoL~Surgery. As such, the model 
now predicts quality of life after surgery from the variable Surgery and the intercept. 5 

randomlnterceptSurgery <-lmeCPost_QoL ~ Surgery, data = surgeryData, random 

= ~llClinic, method = "ML") 

summary(randomlnterceptSurgery) 

The resulting Output 19.8 shows this model. Note that the BIC has actually increased 
from 1922.33 in the previous model to 1924.62 in this one, which suggests that adding 
Surgery has not improved the fit of the model (consistent with this interpretation the log- 
likelihood has increased also). This interpretation is also borne out by the fixed effect of 
Surgery, which is not significant, b = 1.66, t(265) = -1.83, p = .068. 

Although Surgery does not appear to be a significant predictor, we had a final model that 
also included baseline quality of life, so we should add that fixed effect too. Our model is 
now described by: 


QoL After Surgery- = b Q - + ^Surgery+ fo 2 QoL Before Surgery+ e- 


To add this predictor we again use lme() to create a new object (which I have called ran- 
domlnterceptSurgeryQoL). This function is exactly the same as before except that we have 
replaced Post_QoL~Surgery with Post_QoL~Surgery + Base_QoL. Hopefully, you can see 
that each time we specify a model we simply write out the equation that describes the 
model, but without the bs. As such, the model now predicts quality of life after surgery 
from the variables Surgery, Base_QoL and the intercept, with intercepts varying across 
clinics. 

randomlnterceptSurgeryQoL <-lme(Post_QoL ~ Surgery + Base_QoL, data = sur¬ 
geryData, random = ~11 Clinic, method = "ML") 
summary(randomlnterceptSurgeryQoL) 

Linear mixed-effects model fit by maximum likelihood 
Data: surgeryData 

AIC BIC logLik 

1910.137 1924.619 -951.0686 

Random effects: 

Formula: ~1 | Clinic 


5 You might wonder where the ‘1’ representing the intercept has gone. The intercept is implied within the model 
so we don’t need to specify it, although we could be explicit and write Post QoL ~ 1 + Surgery. The end result 
would be the same. Similarly, we could write Post QoL ~ 0 + Surgery if we wanted to remove the intercept from 
the model. 
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(Intercept) Residual 
StdDev: 6.099513 7.18542 

Fixed effects: Post_QoL ~ Surgery 

Value Std.Error DF t-value p-value 

0.000 

0.068 


Max 
2.8644538 

Number of Observations: 276 
Number of Groups: 10 

Output 19.8 


(Intercept) 59.30517 2.0299632 265 29.21490 
Surgery 1.66583 0.9091314 265 1.83233 

Correlation: 

(Intr) 

Surgery -0.21 

Standardized Within-Group Residuals: 

Min Q1 Med Q3 

-1.8904290 -0.7191399 -0.1420998 0.7177762 



i 


s 


rS 

b 

R’s Souls’ Tip 19.2 

UR! 


1 


Missing data (D 


We saw in R’s Souls’ Tip 7.1 that missing data create problems for linear models (i.e., regression). Unlike ordinary 
regresison, multilevel models do not require balanced data sets, so you can have missing data, but you still need 
to tell R what to do with missing cases: just like ordinary regression, if you try to do a multilevel model with missing 
values in the dataframe you will get an error because lme() does not know what to do with these values. As with 
ordinary regression, you should add na.action = na.exclude to the lme() function to let it know that it can ignore 
any NAs it finds in the dataframe. In the surgery data we don’t have missing values so we haven’t had to use this 
instruction, but we could easily insert it. For example, the command for our model with Surgery and Base_QoL 
would become: 


randomlnterceptSurgeryQoL <-lme(Post_QoL ~ Surgery + Base_QoL, data = surgeryData, 
random = -HClinic, method = "ML", na.action = na.exclude) 


Linear mixed-effects model fit by maximum likelihood 
Data: surgeryData 

AIC BIC logLik 

1847.49 1865.592 -918.745 


Random effects: 

Formula: ~1 | Clinic 

(Intercept) Residual 
StdDev: 3.039264 6.518986 


Fixed effects: Post_QoL ~ Surgery + 
Value Std.Error DF 

(Intercept) 29.563601 3.471879 264 

Surgery -0.312999 0.843145 264 

Base_QoL 0.478630 0.052774 264 


Base_QoL 

t-value 

8.515160 

-0.371228 

9.069465 


p-value 

0.0000 

0.7108 

0.0000 
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Correlation: 

(Intr) Surgry 
Surgery 0.102 
Base_QoL -0.947 -0.222 

Standardized Within-Group Residuals: 

Min Q1 Med Q3 Max 

-1.8872666 -0.7537675 -0.0954987 0.5657241 3.0020852 

Number of Observations: 276 
Number of Groups: 10 

Output 19.9 

Output 19.9 shows the summary of the final model. Note that the BIC and AIC have 
both decreased since the previous model (Output 19.8); for example, the BIC has reduced 
from 1924.62 to 1865.59. This implies a better fitting model. We can have a look at how 
the fit of the models has improved using the anova() function that we used before; the fol¬ 
lowing will compare all three models that we have so far fitted): 

anova(randomlntercept0nly, randomlnterceptSurgery, 
randomlnterceptSurgeryQoL) 

Output 19.10 shows the resulting analysis. We start with just varying intercepts (model 
1), and we can see that by including surgery as a fixed effect (model 2) we got no significant 
improvement ( p > .05). In fact the change in the -2LL is only 3.34, which has a signifi¬ 
cance of p = .068 (note that this is the same significance as for the t that tests the regression 
parameter of the fixed effect of surgery in Output 19.8). In model 3, we added the effect 
of baseline quality of life, and this had a dramatic impact. The -2LL changed by 64.65 and 
this change was highly significant, / 2 (1) = 64.65, p < .0001. As we have already noted the 
AIC and BIC also decrease from model 2 to model 3 showing that the fit is improving. 

Model df AIC BIC logLik Test L.Ratio p-value 

1 3 1911.473 1922.334 -952.7364 

2 4 1910.137 1924.619 -951.0686 1 vs 2 3.33564 0.0678 

3 5 1847.490 1865.592 -918.7450 2 vs 3 64.64721 <.0001 

Output 19.10 

Given that including baseline quality of life has improved the fit of our model so much, 
let’s briefly take stock of the model (shown in Output 19.9). The regression parameter for 
the effect of Surgery is -0.31, which is not significant, t( 264) = -0.37, p > .05. However, 
baseline quality of life has a regression parameter of 0.48, which is highly significant t(264) 
= 9.07, p < .001. The standard deviation of the intercepts is 3.04 (we can square this value 
to get the variance of intercepts across clinics, which in this case is 9.24). 


19 . 6 . 8 . 


Introducing random slopes © 


We have seen that including a random intercept is important for this model (it changes the 
log-likelihood significantly). Figure 19.9 suggests that different clinics have different slopes; 
therefore, we could now look at whether adding a random slope will benefit the model. The 
model is now described by equation (19.9), which we saw earlier on; it can be specified in R 
with only minor modifications to the lme() function. All we are doing is adding another ran¬ 
dom term to the model, so, whereas before we specified random part of the model as ran¬ 
dom = —1 1 Clinic , we now need to change this to random = —Surgery \ Clinic. This change 




CHAPTER 19 MULTILEVEL LINEAR MODELS 


885 


tells R that the model now allows the effect of Surgery (i.e., the slope) to vary across clinics. 
As when we specified the main model, the intercept is implied in this, so the change will give 
us both random intercepts over clinics, but also random slopes for the variable Surgery . 6 

addRandomSlope<-lme(Post_QoL ~ Surgery + Base_QoL, data = surgeryData, random 

= -SurgerylClinic, method = "ML") 

summary(addRandomSlope) 

anova(randomInterceptSurgeryQoL,addRandomSlope) 

The code above creates a new object called addRandomSlope, which is the same model as 
before but with a random slope added for the effect of Surgery. We then ask for a summary 
of this new model as we have done before, we then compare this new model to the previ¬ 
ous one, using the anovaQ function, which we have used before. Output 19.11 shows the 
results of this model comparison. By allowing the effect of Surgery to vary across clinics we 
have reduced the BIC from 1865.59 to 1837.97, and the -2LL has changed significantly, 7 
X 2 (2) = 38.87, p < .0001. In short, adding random slopes to the model has significantly 
improved its fit, which means there is significant variability in the effect of surgery across 
clinics. 

Across this model and the previous one, we can conclude from the -2LL as we built up 
the models that the intercepts, x 2 (1) = 64.65, p < .0001, and slopes, x 2 (2) = 38.87, p < 
.0001, for the relationship between surgery and quality of life (when controlling for base¬ 
line quality of life) vary significantly across the different clinics. 

Model df AIC BIC logLik Test L.Ratio p-value 

1 5 1847.490 1865.592 -918.7450 

2 7 1812.624 1837.966 -899.3119 1 vs 2 38.86626 <.0001 

Output 19.11 

Output 19.12 shows the summary of the model that contains both random slopes and 
intercepts. The regression parameter for the effect of Surgery is now b = -0.65, which is 
still not significant, 7(264) = -0.31, p > .05. Baseline quality of life is still highly significant, 
b = 0.31, 7(264) = 5.80, p < .001. The standard deviation of the intercepts is 6.13, and the 
effect of Surgery has a standard deviation of 6.20 (which we can square to find the vari¬ 
ance, which is 38.41). The slopes and intercepts are highly correlated also (r = —.97). 

Linear mixed-effects model fit by maximum likelihood 
Data: surgeryData 

AIC BIC logLik 

1812.624 1837.967 -899.3119 


Random effects: 
Formula: -Surgery | 
Structure: General 
StdDev 


(Intercept) 6.132655 
Surgery 6.197489 
Residual 5.912335 


Clinic 

positive-definite, 
Corr 
(Intr) 

-0.965 


Log-Cholesky parametrization 


6 It would be pretty unusual to want random slopes but not intercepts, because variability in the effect of a variable 
across contexts will tend to create variability in the intercepts across contexts too. However, should you want to 
have a model with random slopes but not intercepts you simply change the random effect to ~0 + Surgery | Clinic-, 
the 0 gets rid of the intercept. 

7 The observant among you will notice that even though we have added only one new term to the model (the 
random slope of surgery across clinics), the degrees of freedom have increased by 2 rather than 1. That’s odd isn’t 
it? Actually, it’s not and it happens because by including random slopes we actually add two parameters to the 
model: the estimate of the variance of the effect of surgery across clinics, and also an estimate of the covariance 
between slopes and intercepts (i.e., the extent to which intercepts and slopes are dependent on each other). 
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Fixed effects: Post_QoL ~ Surgery + Base_QoL 

Value Std.Error DF t-value p-value 

(Intercept) 40.10253 3.892945 264 10.301334 0.0000 

Surgery -0.65453 2.110917 264 -0.310069 0.7568 

Base_QoL 0.31022 0.053506 264 5.797812 0.0000 

Correlation: 

(Intr) Surgery 
Surgery -0.430 
Base_QoL -0.855 -0.063 


Standardized Within-Group Residuals: 

Min Q1 Med Q3 Max 

-2.4114778 -0.6628574 -0.1138411 0.6833110 2.8334730 


Number of Observations: 276 
Number of Groups: 10 

Output 19.12 


FIGURE 19.9 

Predicted values 
from the model 
(surgery predicting 
quality of life after 
controlling for 
baseline quality 
of life) plotted 
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19 . 6 . 9 . 


Adding an interaction term to the model © 


We can now build up the model by adding in another variable. One of the variables we 
measured was the reason for the person having cosmetic surgery: was it to resolve a physi¬ 
cal problem or was it purely for vanity? We can add this variable to the model, and also 
look at whether it interacts with surgery in predicting quality of life. Our model will simply 
expand to incorporate these new terms, and each term will have a regression coefficient 
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(which we select to be fixed). Therefore, our final model can be described as in the equa¬ 
tion below (note that all that has changed is that there are two new predictors): 

QoLAfter ;/ = b ()] + ^Surgery- + b 2 QoL Before Surgery- + fc 3 Reason- 
+ b 4 (Re ason x Surgery) ;; +e^ 

b 0j = b 0 + u 0j (19.1) 

K i = b i + »l,- 

To set up this model in R is very easy; it just requires some minor changes to the code. 
First, we’ll add in the effect of Reason. To do this we create a new model, which I have 
called addReason, and we set it up in exactly the same way as before, except that we add 
Reason to the model. So, our model changes from Post_QoL~Surgery + Base_QoL to 
Post_QoL~Surgery + Base_QoL + Reason. It’s as simple as that. In fact, it can be even sim¬ 
pler if you use the update function (see R’s Souls’ Tip 19.3). 

addReason<-lme(Post_QoL ~ Surgery + Base_QoL + Reason, data = surgeryData, 
random = ~SurgeryI Clinic, method = "ML") 

As we’ve seen before, to add an interaction term we use a colon (i.e., Surgery .-Reason). I 
have called this model finalModel and we specify it exactly the same as the previous model 
except that we add in the interaction term. Again, we could simplify this process by using 
the update function (see R’s Souls’ Tip 19.3). 

finalModel<-lme(Post_QoL ~ Surgery + Base_QoL + Reason + Surgery:Reason, data = 
surgeryData, random = ~SurgerylClinic, method = "ML") 



The update function (D 


Throughout the first example in this chapter I have built the models up piece by piece because I think it’s useful 
to see how the code relates to the equation that describes the model. However, as we have seen before (see R’s 
Souls' Tip 7.2) the update() function is a quicker way to add new things to old models. Let’s start with the random 
slopes example from the main text. Our model is as follows: 


addRandomSlope<-lme(Post_QoL ~ Surgery + Base_QoL, data = surgeryData, random = 
~SurgeryI Clinic, method = "ML") 

In the text, we added the variable Reason to this model, using the longhand method: 

addReason<-lme(Post_QoL ~ Surgery + Base_QoL + Reason, data = surgeryData, random = 
~SurgeryI Clinic, method = "ML") 


Using the update() function we can do the same thing in much less text 
addReason<-update(addRandomSlope, + Reason) 


This function, like the longhand one, creates a new model called addReason, and it does this by updating an existing 
model. The first part of the parenthesis tells R that we want to update the model called addRandomSIope ; the.—. + 
Reason tells R to keep the outcome variable and all of the previous predictors and to add Reason as a predictor. 

Similarly, having created the model addReason, we can update this model to include the Surgery x Reason 
interaction (as we did in the main text). We can again use the update function as follows: 

finalModel<-update(addReason, + Reason:Surgery) 

This command creates a new model called finalModel. Note that we have specified addReason in the update() 
function because we want to update the model that includes Reason as a predictor, and have added the interac¬ 
tion term typing + ReasonSurgery. 
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We need to see whether adding these new terms has improved the fit of the model and 
again we can simply use the anova() function. We have two new models ( addReason and 
finalModel) that we want to compare to our previous model ( addRandomSlope ). We can do 
this comparison in a single function, remembering to order our models in the same order 
that they were built (so each time we have added only a single parameter): 

anova(addRandomSlope, addReason, finalModel) 

Output 19.13 presents the resulting output, which shows that adding Reason to the 
model reduces the -2LL by 3.80, which is not quite a significant change, p = .0513; how¬ 
ever, adding the Surgery x Reason interaction reduces the -2LL by 5.78, which is a signifi¬ 
cant change, p < .05. 

Model df AIC BIC logLik Test L.Ratio p-value 

addRandomSlope 1 7 1812.62 1837.96 -899.31 

addReason 2 8 1810.82 1839.78 -897.41 1 vs 2 3.7989 0.0513 

finalModel 3 9 1807.04 1839.62 -894.52 2 vs 3 5.7795 0.0162 

Output 19.13 

Output 19.14 shows the summary of the final model. Quality of life before surgery sig¬ 
nificantly predicted quality of life after surgery, t(262) = 5.75, p < .001, surgery still did 
not significantly predict quality of life, t(262) = —1.46, p = .15, but the reason for surgery, 
t(262) = -3.08, p < .01, and the interaction of the reason for surgery and surgery, t(262) 
= 2.48, p < .05, both did significantly predict quality of life. The table of estimates also 
gives us the regression coefficients. However, if we want to get confidence intervals for 
these parameters we need to use the function intervals(), within which we simply specify 
the model for which we would like confidence intervals, and the level of the confidence 
intervals as a proportion (i.e., 0.99 produces 99% confidence intervals). For example: 

intervals(finalModel , 0.90) 
intervals(finalModel , 0.95) 
intervals(finalModel , 0.99) 

produce 90%, 95% and 99% intervals. If you insert only the model name into the function 
you will get 95% CIs by default. 

Output 19.15 shows the 95% confidence intervals for our final model. We can see, for 
example, that Surgery had a b = —3.19, with a 95% confidence interval of -7.45 (lower) 
and 1.08 (upper). This interval crosses zero and so is not significant at p < .05, which 
of course we knew already from the main summary. These confidence intervals are very 
useful for establishing whether the variance of the intercepts and slopes is significant. For 
example, we can see that the standard deviation for intercepts was 5.48 with a 95% confi¬ 
dence interval from 3.31 to 9.07, for the slope of Surgery we get 5.42 (3.13,9.37); because 
both confidence intervals do not cross zero we can see that the variability in both slopes 
and intercepts was significant, p < .05. 

Linear mixed-effects model fit by maximum likelihood 
Data: surgeryData 

AIC BIC logLik 

1807.045 1839.629 -894.5226 

Random effects: 

Formula: -Surgery | Clinic 

Structure: General positive-definite, Log-Cholesky parametrization 
StdDev Corr 

(Intercept) 5.482366 (Intr) 

Surgery 5.417501 -0.946 

Residual 5.818910 
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Fixed effects: 

Post_QoL 

~ Surgery 


Value 

Std.Error 

(Intercept) 

42.51782 

3.875318 

Surgery 

-3.18768 

2.185369 

Base_QoL 

0.30536 

0.053125 

Reason 

-3.51515 

1.140934 

Surgery:Reason 

4.22129 

1.700269 

Correlation: 




(Intr) Surgry Bas_( 

Surgery 

-0.356 


Base_QoL 

-0.865 -0 

.078 

Reason 

-0.233 0 

.306 0.0i 

Surgery:Reason 

0.096 -0 

.505 o.o: 


+ Base_QoL + Reason + Reason:Surgery 
DF t-value p-value 
262 10.971440 0.0000 

262 -1.458645 0.1459 

262 5.747833 0.0000 

262 -3.080938 0.0023 

262 2.482717 0.0137 


QL Reason 


65 

24 -0.661 


Standardized Within-Group Residuals: 

Min Q1 Med Q3 Max 

-2.2331485 -0.6972193 -0.1541074 0.6326387 3.1641797 


Number of Observations: 276 
Number of Groups: 10 

Output 19.14 


Approximate 95% confidence intervals 


Fixed effects: 


(Intercept) 

Surgery 

Base_QoL 

Reason 

Surgery:Reason 
attr(,"label") 


lower est. upper 

34.9565211 42.5178190 50.0791170 
-7.4516428 -3.1876767 1.0762895 

0.2017008 0.3053561 0.4090114 

-5.7412731 -3.5151479 -1.2890227 
0.9038206 4.2212885 7.5387563 


[1] "Fixed effects: 


Random Effects: 

Level: Clinic 

sd((Intercept)) 
sd(Surgery) 

cor((Intercept),Surgery) 


lower 
3.3138275 
3.1331192 
-0.9937813 


est. 
5.4823658 
5.4175011 
-0.9455545 


upper 

9.0699757 

9.3674439 

-0.5986153 


Within-group standard error: 
lower est. upper 

5.331222 5.818910 6.351211 

Output 19.15 

As this is our final model, let’s now give some thought to interpretation. The effect of the 
reason for surgery is easy to interpret. Given that we coded this predictor as 1 = physical 
reason and 0 = change appearance, the negative coefficient tells us that as reason increases 
(i.e., as a person goes from having surgery to change their appearance to having it for a 
physical reason) quality of life decreases. However, this effect in isolation isn’t that inter¬ 
esting because it includes both people who had surgery and the waiting list controls. More 
interesting is the interaction term, because this takes account of whether or not the person 
had surgery. To break down this interaction we could rerun the analysis separately for the 
two ‘reason groups’. Obviously we would remove the interaction term and the main effect 
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of Reason from this analysis (because we are analysing the physical reason group separately 
from the group who wanted to change their appearance). As such, you need to fit the 
model in the previous section, but separately for those who had cosmetic surgery and those 
who had surgery for a physical reason. 

Fortunately, this is fairly easy to do because lme() has a subset option that enables you 
to select a variable that specifies which rows of the data file that you want to use. First we 
need to create a variable that returns ‘True’ if the person had surgery for physical reasons 
and ‘False’ for those who had cosmetic surgery: 

physicalSubsetc- surgeryData$Reason==l 

This function simply creates a variable called physicalSubset that will be set to TRUE only if 
the variable Reason is equal to 1. In our data set people who have a ‘1’ for Reason are those 
who had surgery for a physical reason, so we are asking R to set physicalSubset to TRUE if 
the person had surgery for a physical reason. (Remember that surgery Data $Reason means 
‘the variable called Reason in the data set called surgeryData’, and that in R '=’ means 
‘equal to’.) We also create another variable called cosmeticSubset that returns TRUE for 
any people who had cosmetic surgery. We create this variable in much the same way (but 
now we specify that the variable Reason must be equal to 0, because in our data set 0 rep¬ 
resents people who had cosmetic surgery): 

cosmeticSubset<-surgeryData$Reason==0 

Next, we create two new models that contain Base_QoL and Surgery as predictors and 
have random slopes and intercepts. The first one is for people who had surgery for a 
physical reason, so we set the subset option to be physicalSubset, which specifies all of the 
people who had surgery for this reason: 

physicalModel<-lmeCPost_QoL ~ Surgery + Base_QoL, data = surgeryData, random = 
-SurgerylClinic, subset= physicalSubset, method = "ML") 

This creates a model (called physicalModel). Note that I have used subset = physicalSubset, 
which uses the variable physicalSubset to determine whether or not to include a particular 
case of data. By doing this our resulting model will include only those who had surgery 
for a physical reason. Similarly, we can use the variable cosmeticSubset to create another 
model that is fit using only those who had cosmetic surgery: 

cosmeticModel<-lmeCPost_QoL ~ Surgery + Base_QoL, data = surgeryData, random = 
-SurgerylClinic, subset= cosmeticSubset, method = "ML") 

We can get a summary of these models in the usual way: 

summa ry(physicalModel) 
summary(cosmeticModel) 

Linear mixed-effects model fit by maximum likelihood 
Data: surgeryData 
Subset: physicalSubset 

AIC BIC logLik 

1172.560 1194.832 -579.2798 

Random effects: 

Formula: -Surgery | Clinic 

Structure: General positive-definite, Log-Cholesky parametrization 
StdDev Corr 

(Intercept) 5.773827 (Intr) 

Surgery 5.804865 -0.948 

Residual 5.798764 
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Fixed effects: Post_QoL ~ Surgery + Base_QoL 

Value Std.Error DF t-value p-value 

(Intercept) 38.02079 4.705980 166 8.079250 0.0000 

Surgery 1.19655 2.099769 166 0.569848 0.5696 

Base_QoL 0.31771 0.069471 166 4.573271 0.0000 

Correlation: 

(Intr) Surgry 
Surgery -0.306 
Base_QoL -0.908 -0.078 


Standardized Within-Group Residuals: 

Min Q1 Med Q3 Max 

-2.2447342 -0.6505340 -0.1264188 0.6111506 2.9472101 


Number of Observations: 178 
Number of Groups: 10 

Output 19.16 

Linear mixed-effects model fit by maximum likelihood 
Data: surgeryData 
Subset: cosmeticSubset 

AIC BIC logLik 

650.9469 669.0417 -318.4734 


Random effects: 

Formula: -Surgery | Clinic 

Structure: General positive-definite, Log-Cholesky parametrization 
StdDev Corr 

(Intercept) 5.006026 (Intr) 

Surgery 5.292027 -0.969 

Residual 5.738551 


Fixed effects: Post_QoL - Surgery 
Value Std.Error DF 

(Intercept) 41.78605 5.573849 87 

Surgery -4.30702 2.275002 87 

Base_QoL 0.33849 0.080274 87 

Correlation: 


+ Base_QoL 

t-value p-value 
7.496802 0.0000 

-1.893193 0.0617 

4.216720 0.0001 


(Intr) Surgry 
Surgery -0.252 
Base_QoL -0.937 -0.058 
Standardized Within-Group Residuals: 

Min Q1 Med Q3 Max 

-1.8945645 -0.6616222 -0.1461451 0.6460834 2.6741347 

Number of Observations: 98 


Number of Groups: 9 


Output 19.17 


The model for those who had surgery for a physical reason is shown in Output 19.16, 
whereas the model for those who had cosmetic surgery is in Output 19.17. It shows that 
for those operated on only to change their appearance, surgery almost significantly pre¬ 
dicted quality of life after surgery, b = -4.31, t( 87) = -1.89, p = .06. The negative gradient 
shows that for these people quality of life after surgery was lower compared to the control 
group. However, for those who had surgery to solve a physical problem surgery did not 
significantly predict quality of life, b = 1.20, t(166) = 0.57, p = .57. However, the slope 
was positive, indicating that people who had surgery scored higher on quality of life than 
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those on the waiting list (although not significantly so!). The interaction effect, therefore, 
reflects the difference in slopes for surgery as a predictor of quality of life in those who had 
surgery for physical problems (slight positive slope) and those who had surgery purely for 
vanity (a negative slope). 

We could sum up these results by saying that quality of life after surgery, after control¬ 
ling for quality of life before surgery, was lower for those who had surgery to change their 
appearance than those who had surgery for a physical reason. This makes sense because for 
those having surgery to correct a physical problem, the surgery has probably brought relief 
and so their quality of life will improve. However, for those having surgery for vanity they 
might well discover that having a different appearance wasn’t actually at the root of their 
unhappiness, so their quality of life is lower. 



CRAMMING SAM’S TIPS 


Multilevel models R output 


• The -2/1 and its significance can be used to compare models that are the same in all but one parameter. The AIC and BIC 
can also be compared across models (but not significance tested). 

• The fixed effects tell you whether your predictors significantly predict the outcome. If the significance value is less than .05 
then the effect is significant. 

• Interpret the nature of the effect using the regression coefficient and its confidence interval. The direction of these coefficients 
tells us whether the relationship between each predictor and the outcome is positive or negative. To get the confidence inter¬ 
vals you need to use the intervalsf) function. 

• The standard deviation of random effects can tell us how much intercepts and slopes varied over our level 1 variable. The 
significance of these estimates can be ascertained from their confidence intervals, obtained using the intervals() function. 


19.7. Growth models © 


Growth models are extremely important in many areas of science, including psychology, 
medicine, physics, chemistry and economics. In a growth model the aim is to look at the 
rate of change of a variable over time: for example, we could look at white blood cell 
counts, attitudes, radioactive decay or profits. In all cases we’re trying to see which model 
best describes the change over time. 


19 . 7 . 1 . 


Growth curves (polynomials) © 


Figure 19.10 gives some examples of possible growth curves. This diagram shows three 
polynomials representing a linear trend (the dark blue line) otherwise known as a first-order 
polynomial, a quadratic trend (the light blue line) otherwise known as a second-order poly¬ 
nomial, and a cubic trend (the black line) otherwise known as a third-order polynomial. 
Notice first that the linear trend is a straight line, but as the polynomials increase they 
get more and more curved, indicating more rapid growth over time. Also, as polynomials 
increase, the change in the curve is quite dramatic (so dramatic that I adjusted the scale of 
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FIGURE 19.10 

Illustration of a 
first-order (linear, 
blue), second- 
order (quadratic, 
light blue) and 
third-order (cubic, 
black) polynomial 


the graph to fit all three curves on the same diagram). This observation highlights the fact 
that any growth curve higher than a quadratic (or possibly cubic) trend is very unrealistic 
in real data. By fitting a growth model to the data we can see which trend best describes 
the growth of an outcome variable over time (though no one will believe that a significant 
fifth-order polynomial is telling us anything meaningful about the real world!). 

The growth curves that we have described might seem familiar to you: they 
are the same as the trends that we described for ordered means in section 10.4.5. 

What we are discussing now is really no different. There are just two important 
things to remember when fitting growth curves: (1) you can fit polynomials up 
to one less than the number of time points that you have; and (2) a polynomial 
is defined by a simple power function. On the first point, this means that with 
three time points you can fit a linear and quadratic growth curve (or a first- and 
second-order polynomial), but you cannot fit any higher-order growth curves. 

Similarly, if you have six time points you can fit up to a fifth-order polynomial. 

This is the same basic idea as having one less contrast than the number of groups 
in ANOVA (see section 10.4). 

On the second point, we have to define growth curves manually in multilevel models in 
R: there is not a convenient option that we can select to do it for us. However, this is quite 
easy to do. If time is our predictor variable, then a linear trend is tested by including this 
variable alone. A quadratic or second-order polynomial is tested by including a predictor 
that is time 2 , a cubic or third-order polynomial is tested by including a predictor that is 
time 3 and so on. So any polynomial is tested by including a variable that is the predictor to 
the power of the order of polynomial that you want to test: for a fifth-order polynomial we 
need a predictor of time 5 and for an «-order polynomial we would have to include time" as 
a predictor. Hopefully you get the general idea. 
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19 . 7 . 2 . 


An example: the honeymoon period (D 



I once saw a brilliant talk given by Professor Daniel Kahneman, who won the 2002 
Nobel Prize for Economics. In this talk Kahneman brought together an enormous 
amount of research on life satisfaction (he explored questions such as whether peo¬ 
ple are happier if they are richer). There was one graph in this talk that particularly 
grabbed my attention. It showed that leading up to marriage people reported greater 
life satisfaction, but by about two years after marriage this life satisfaction decreased 
back to its baseline level. This graph perfectly illustrated what people talk about as 
the ‘honeymoon period’: a new relationship/marriage is great at first (no matter how 
ill suited you may be) but after six months or so the cracks start to appear and eve¬ 
rything turns to elephant dung. Kahneman argued that people adapt to marriage; it 
does not make them happier in the long run (Kahneman &C Krueger, 2006). 8 This got 
me thinking about relationships not involving marriage (is it marriage that makes you 
happy, or just being in a long-term relationship?). Therefore, in a completely ficti¬ 
tious parallel world where I don’t research child anxiety, but instead concern myself 
with people’s life satisfaction, I collected some data. I organized a massive speed¬ 
dating event (see Chapter 14). At the start of the night I measured everyone’s life 
satisfaction (Satisfaction_Baseline) on a 10-point scale (0 = completely dissatisfied, 
10 = completely satisfied) and their gender (Gender). After the speed dating I noted 
all of the people who had found dates. If they ended up in a relationship with the 
person that they met on the speed-dating night then I stalked these people over the 
next 18 months of that relationship. As such, I had measures of their life satisfaction 
at 6 months (Satisfaction_6_Months), 12 months (Satisfaction_12_Months) and 18 
months (Satisfaction_18_Months), after they entered the relationship. None of the 
people measured were in the same relationship (i.e., I measured only life satisfaction 
from one of the people in the couple). 9 Also, as is often the case with longitudinal 
data, I didn’t have scores for all people at all time points because not everyone was 
available at the follow-up sessions. One of the benefits of a multilevel approach is 
that these missing data do not pose a particular problem. The data are in the file 
Honeymoon Period.dat. 

Load these data into R by executing the following command: 


satisfactionData = read.delimC'Honeymoon Period.dat", header = TRUE) 


Figure 19.11 shows the data. Each dot is a data point and the line shows the average 
life satisfaction over time. Basically, from baseline, life satisfaction rises slightly at time 2 
(6 months) but then starts to decrease over the next 12 months. There are two things to 
note about the data. First, time 0 is before the people enter into their new relationship, yet 
already there is a lot of variability in their responses (reflecting the fact that people will 
vary in their satisfaction due to other reasons such as finances, personality and so on). This 
suggests that intercepts for life satisfaction differ across people. Second, there is also a lot 
of variability in life satisfaction after the relationship has started (time 1) and at all subse¬ 
quent time points, which suggests that the slope of the relationship between time and life 
satisfaction might vary across people also. If we think of the time points as a level 1 vari¬ 
able that is nested with people (a level 2 variable) then we can easily model this variability 
in intercepts and slopes within people. We have a situation similar to Figure 19.4 (except 


8 The romantics among you might be relieved to know that others have used the same data to argue the complete 
opposite: that married people are happier than non-married people in the long term (Easterlin, 2003). 

9 However, I could have measured both people in the couple because, using a multilevel model I could have 
treated people as being nested within ‘couples’ to take account of the dependency in their data. 
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FIGURE 19.11 

Life satisfaction 
overtime 


with two levels instead of three, although we could add in the location of the speed dating 
event as a level three variable if we had that information). 


19 . 7 . 3 . 


Restructuring the data © 


The first problem with having data measured over time is that to do a multilevel model the 
data need to be in a different format than what we are used to. For a repeated-measures 
design we normally set up the data with each row representing a person: in this case, the 
repeated-measures variable of time will be represented by four different columns (see sec¬ 
tion 3.9.4). We saw in Chapter 3 that this is called the ‘wide’ format. If we were going to 
run an ordinary repeated-measures ANOVA, this data layout would be fine; however, for a 
multilevel model we need the variable Time to be represented by a single column. We refer 
to this format as the ‘long’ format. As such we need to restructure the data. 



SELF-TEST 

s Thinking back to Chapter 3, use the melt() function 
to restructure the data into long format. If you get 
stuck, section 3.9.4 shows you how. Call your new 
dataframe restructured Data. 


19 . 7 . 4 . 


Setting up the basic model © 


Now that we have our data set up, we can run the analysis. Essentially, we can set up this 
analysis in a very similar way to the previous example. There is only one important dif¬ 
ference: because we are working with time series data we have to model the covariance 
structure (see section 19.4.2). The most common way to do this is to assume a first-order 
autoregressive covariance structure; to remind you, this means that data points close in 
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time are assumed to be more highly correlated than data points distant in time. In all other 
respects we set up the model in the same way as in the previous example. 

First we fit a baseline model in which we include only the intercept. As in the previous 
example, this is done using the gls() function: 

intercept <-gls(Life_Satisfaction ~ 1, data = restructuredData, method = 
"ML", na.action = na.exclude) 


Note that we have asked R to create an object called intercept, and we have specified that 
Life_Satisfaction is the outcome variable and that it is predicted from only the intercept 
(the ‘~1’ in the function). The rest of the function specifies the data {data = restructured¬ 
Data) and how to estimate the model {method = “ML”). There is one important difference 
compared to the previous example. We have included a new option na.action = na.exclude. 
This is because the current data file (unlike the last example) has missing data and these 
missing values are specified as ‘NA’ in the data file. This option tells R what to do when 
it encounters an ‘NA’, and we have set it to exclude these cases (see R’s Souls’ Tip 19.2). 
Without this option the model would return an error. 

Next, we need to fit the same model, but this time allowing the intercepts to vary across 
contexts (in this case we want them to vary across people). As in the previous example, we 
use the lme() function: 

randomlntercept <-lme(Life_Satisfaction ~ 1, data = restructuredData, 
random = -llPerson, method = "ML", na.action = na.exclude, control = 
list(opt="optim")) 


The format of this command is the same as the previous example. We create a model (called 
randomlntercept) that predicts life satisfaction from only the intercept {Life_Satisfaction—1), 
but also allows intercepts to vary across people {random = ~1 \Person). Remember that the 
variable Person in the data set is a numeric variable that indicates whether data come from 
the same person. We have again asked for maximum-likelihood estimation {method = “ML”) 
and to exclude data points that are missing {na.action = na.exclude). However, note that 
there is a new option that we have not encountered before: control = list(opt=“optim”). This 
option changes the optimizer that R uses to estimate the model. Normally the default opti¬ 
mizer is fine, but for these data some of the models cannot be computed using the default, 
so I have changed it to one that succeeds (see R’s Souls’ Tip 19.4). 



Advanced options for lme() 


<D 


You can use the control option in lme() to change the default options for the estimation procedure. A couple of 
the parameters you might want to change if your models won’t converge are: 


• maxlter: This sets the maximum number of iterations that R will use to reach a solution. The default is 50, 
but if your model fails to converge then you can increase this value, for example to 100, using control = 
list(maxlter = 100). 

• opt\ This sets the optimizer that is used. The default (from R 2.2.0) is called nlminb, but there is an alterna¬ 
tive optimizer called optim. If your model fails to converge it can be useful to try using a different optimizer. 
For example, some of the models in the honeymoon period example do not converge using the defaults, 
which is why we changed from the default optimizer by using control = list(opt = “optim”). 

You can change both parameters using the same option by simply including both in the llst() that you specify. For 
example, to increase the number of iterations to 2000 and to change the optimizer from nlminb to opt you would 
use control = list(maxlter = 2000, opt = “optim”). For a full list of advanced options execute ?lmeControl. 
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19 . 7 . 5 . 


Adding in time as a fixed effect (D 


In a growth curve analysis, we are primarily interested in one fixed effect: time. This vari¬ 
able in our data set is the index variable (Time), which specifies whether the life satisfaction 
score was recorded at baseline (0), 6 months (1), 12 months (2) or 18 months (3). In the 
previous example we built up our models individually using the lme() function; however, 
for this example our models have a lot of options ( method = “ML”, na.action = na.exclude, 
control = list(opt=“optim”)), which we must specify in each new model. This typing would 
be tedious, so we will use the update() function to retain everything from a previous model 
(including options such as the method, how to deal with missing cases, and the optimiza¬ 
tion method) but add things to it (R’s Souls’ Tip 19.3). We can quickly update the previous 
model ( randomlntercept) to include Time as a predictor by executing: 

timeRI<-update(randomIntercept, + Time) 

This command creates a new object in R called timeRI. The first part of the parenthesis tells 
R which model to update (in this case we have updated randomlntercept). The remainder 
tells R how to update this model: .—. simply means ‘keep the previous outcome and all of 
the previous predictors’ and + Time means ‘add Time as a predictor’. If you want to have 
a look at the new model you can use summary (timeRI). 


19 . 7 . 6 . 


Introducing random slopes © 


We can add a random slope to the model very simply using the update() function and 
respecifying the random part of the model. At the moment, the random part of the model 
is specified as random = — 1 \ Person, which means that intercepts (~1) vary across people 
(Person). If we want slopes to vary across people as well, then we’re saying that the effect 
of Time is different in different people. This is a standard growth model scenario: the rate 
of development or growth over time differs within entities (in this case people, but it could 
be companies, mice, states, hospitals, schools, geographical areas, etc.). As such, we want 
to update the random part of the model to be random = ~Time\Person, which means that 
intercepts and the effect of time {—Time) vary across people (Person). We use the update() 
function to create a new model (called timeRS) which is identical to the previous model 
{timeRI) but updates the random part of the model to be random = —Time\Person: 

timeRS<-update(timeRI , random = ~TimeI Person) 


19 . 7 . 7 . 


Modelling the covariance structure © 


Now we have a basic random effects model, we can introduce a term that models the cov¬ 
ariance structure or errors (see section 19.4.2). We do this by using the option correlation = 
x, in which x is one of several pre-defined covariance structures. The most likely covariance 
structures that’ll you’ll use will be (for a full list execute ?corClasses)\ 

• corARl (): This is a first-order autoregressive covariance structure (see Jane Superbrain 
Box 19.1). It should be used when time points are equally spaced (as is the case in the 
current example). 

• corCARlQ: This is similar to the above but for use with a continuous time covariate. 
Basically you should use this covariance structure if your time points are not equally spaced. 

• corARMA(): Another autoregressive error structure, but this one allows the correlation 
structure to involve a moving average of error variance (see Jane Superbrain Box 19.1). 
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We can add a covariance structure to the model using the update() function to create a new 
model (called ARModel) which is identical to the previous model ( timeRS ) but adds in a 
first-order autoregressive covariance structure: 

ARModel<-update(timeRS, correlation = corARl(0, form = ~TimeI Person)) 

Note that we have used correlation = corARl(0, form = —Time \Person). We could have used 
the default setting, which would be to use correlation = corARl(), but this would include 
only the random intercept (it would be the same as specifying correlation = corARl(0, form 
= —1 1 Person). 



|ANE SUPERBRAIN 19.1 

Autoregressive and moving 
average models © 

Autoregressive models (AR) are very difficult to under¬ 
stand. I do not really understand them, and I doubt I ever 
will. Imagine that we have a time series_(Y ( ) and we adjust 
this by subtracting the mean (y t = Y t -Y). The t in these 
equations just represents different points in time. As far 
as I can gather, if we have a first-order autoregressive 
model, AR(1), then we predict these adjusted values from 
the adjusted values at the previous time point (i.e., f - 1). 
We can use a standard linear model to do this: 

y t = - a iy ( -i + e, 

The -a is known as the lag or autoregressive coefficient, 
and e t is the residual or error at time t. You can hopefully 
see that this is a simple linear model in which values at 
one time are predicted from values at a previous time 
(the word autoregressive reflects the fact that values are 
predicted from themselves). 

A second-order autoregressive model. AR(2), is much 
the same except that we’re interested not just in the previ¬ 
ous time point, but in the previous two time points. The 
model simply expands to include this extra time point: 

y f =- a iy t -i - a 2 y,-2+ e , 


You should be able to extend this basic logic to understand 
third- and fourth-order autoregressive models. In these 
models, residuals are assumed not to correlate; in other 
words, errors at one time point are not believed to correlate 
with errors at another time point. The data themselves cor¬ 
relate at different time points, but the errors don’t. 

However, we might want to assume that the residuals 
also correlate over time. In other words, our data at time 
t can be predicted not just from the data at the previous 
time point but also the error at previous time points. This 
is known as a moving average (MA) model. Like AR mod¬ 
els, a first-order moving average model factors in residu¬ 
als at the current time point (e ( ) and also residuals from 
the previous time point (e M ). A second-order MA model 
would factor in residuals for the current time point, the 
previous time point, and two points back in time (e ( 2 ), 
and so on. So, for example, if we had an AR(1) and MA(1) 
our model becomes: 

y t = - a iy t _i + e,+ 

Note that this is the same as the AR(1) model except that 
there is an extra term representing the error from the previ¬ 
ous time point, e (1 , which is multiplied by c v a coefficient 
representing the first-order moving average. Models that 
combine autoregressive and moving average models are 
known as ARMA models. ARMA models have two param¬ 
eters: p defines the order of the autoregressive part of the 
model, q specifies the order of the moving average part 
of the model. In both cases 1 = first-order, 2 = second- 
order and so on. Therefore, ARMA(p = 2, q = 1) would be a 
second-order autoregressive model with a first-order mov¬ 
ing average, ARMA(p = 2, q = 2) would be a second-order 
autoregressive model with a second-order moving aver¬ 
age. Hopefully you get the general gist because my brain is 
literally about to explode all over my screen. 
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19 . 7 . 8 . 


Comparing models (D 


So far we have created five models: (1) a baseline model predicting life satisfaction from 
only the intercept ( interceptOnly ); (2) a model with random intercepts across people ( ran- 
domlntercept)-, (3) a model with time as a predictor of life satisfaction and random inter¬ 
cepts across people ( timeRI ); (4) a model with time as a predictor, a random effect of time 
over people and random intercepts ( timeRS ); and (5) a model with time as a predictor, 
random effects of time across people, a random effect of intercepts across people, and a 
first-order autoregressive covariance structure (ARModel). Let’s now look at how these 
models fit the data. Each time we have added only one new component to the model so 
we can compare them with the log-likelihood as we did in the previous example. We can 
compare all five models by executing: 

anova(intercept, randomlntercept, timeRI, timeRS, ARModel) 



Model i 

df AIC 

BIC 

logLik 

Test 

L.Ratio 

p-value 

intercept 

1 

2 

2064.053 

2072.217 

-1030.026 






randomlnt 

2 

3 

1991.396 

2003.642 

-992.698 

1 

VS 

2 

74.657 

<.0001 

timeRI 

3 

4 

1871.728 

1888.057 

-931.864 

2 

vs 

3 

121.667 

<.0001 

timeRS 

4 

6 

1874.626 

1899.120 

-931.313 

3 

vs 

4 

1.102 

0.5763 

ARModel 

5 

7 

1872.891 

1901.466 

-929.445 

4 

vs 

5 

3.736 

0.0533 

Output 19. 

.18 











The resulting output, in Output 19.18, shows that adding a random intercept signifi¬ 
cantly improved the fit of the model, x 2 (l) = 74.66, p < .0001. Similarly, adding the fixed 
effect of time to the model significantly improved the fit compared to the previous model, 
X 2 (l) = 121.67, p < .0001. However, adding a random slope for the effect of time across 
participants did not significantly improve the model, x 2 (2) = 1.10, p = .576. Finally, adding 
a first-order autoregressive covariance structure did more or less significantly improve the 
model, X 2 (1) = 3.74, p = .053. Note that for each model the degrees of freedom change by 
1 because we have added only a single parameter; 10 this change in degrees of freedom is 
used for the log-likelihood test. 

We can take a quick look at the final model, and the confidence intervals for the param¬ 
eter estimates within it by using: 

Summary(ARModel); intervals(ARModel) 

Output 19.19 shows the resulting model summary and Output 19.20 shows the 95% con¬ 
fidence intervals. The effect of time, b = -0.87 (-1.03, -0.71), t( 322) = -10.97, p < .001, 
was highly significant, indicating that life satisfaction significantly changed over the 18 
month period (see Figure 19.11). In addition, the standard deviation of intercepts was 
1.62 (1.31, 2.02), and for the effect of time across people (slopes) was 0.05 (0.00, 41.83). 
Neither of the confidence intervals crosses zero, implying that this variance in slopes and 
intercepts was significant. Note that in the case of the slopes, this finding contradicts the 
results of the log-likelihood statistic, which implied that adding random slopes did not 
significantly improve the model (Output 19.18). The approximate confidence interval for 
slopes is very wide and not symmetrical, which implies that we might be wise to give more 
weight to the log-likelihood. 


10 The exception is the model where we add random slopes. See earlier for an explanation of why the change in 
degrees of freedom is 2 in this case. 
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Linear mixed-effects model fit by maximum likelihood 
Data: restructuredData 

AIC BIC logLik 

1872.891 1901.466 -929.4453 


Random effects: 

Formula: -Time | Person 

Structure: General positive-definite, Log-Cholesky parametrization 
StdDev Corr 

(Intercept) 1.62767553 (Intr) 

Time 0.04782877 -0.062 

Residual 1.74812486 


Correlation Structure: AR(1) 

Formula: -Time | Person 
Parameter estimate(s): 

Phi 

0.2147812 

Fixed effects: Life_Satisfaction - Time 

Value Std.Error DF t-value p-value 

(Intercept) 7.131470 0.21260192 322 33.54377 0 

Time -0.870087 0.07929275 322 -10.97310 0 

Correlation: 

(Intr) 

Time -0.527 

Standardized Within-Group Residuals: 

Min Q1 Med Q3 Max 

-2.08400991 -0.62083911 0.06392492 0.59512953 2.49161500 

Number of Observations: 438 
Number of Groups: 115 

Output 19.19 


Approximate 95% confidence intervals 


Fixed effects: 

lower est. upper 

(Intercept) 6.714162 7.1314700 7.5487782 

Time -1.025728 -0.8700874 -0.7144467 

attr(,"label") 

[1] "Fixed effects:" 


Random Effects: 

Level: Person 

sd((Intercept)) 
sd(Time) 

cor((Intercept),Time) 


lower 
1.314720e+00 
5.468419e-05 
-6.937502e-01 


est. 
1.62767553 
0.04782877 
-0.06192455 


upper 
2.0151263 
41.8327717 
0.6237635 


Correlation structure: 

lower est. upper 

Phi 0.002025179 0.2147812 0.408935 
attr(,"label") 

[1] "Correlation structure:" 
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Within-group standard error: 
lower est. upper 

1.542913 1.748125 1.980630 

Output 19.20 


19 . 7 . 9 . 


Adding higher-order polynomials (D 


We have seen that the main effect of time is significant. This main effect is the lin¬ 
ear trend of time. However, Figure 19.11, seems to show a more curvilinear change 
over time (satisfaction first increases from baseline to 6 months before declining after 6 
months). To capture this trend we would need to add a quadratic or perhaps even cubic 
trend (refer back to Figure 19.10). There are several ways in R to look for trends. One 
way to add quadratic trends is to do it manually. We saw in Figure 19.10 that a quadratic 
trend equates to time 2 and that a cubic trend is simply time 3 . We could, therefore, simply 
create new predictor variables in our dataframe that are time multiplied by itself (time 2 ) 
or time multiplied by itself twice (time 3 ). We could then enter these new variables as 
predictors into the model. 

Fortunately, rather than computing new variables, R can create these new predictors 
‘on the fly’. To create the quadratic term we simply specify I(Time ^2) as a new predictor. 
‘Time ^2’ is R’s way of writing ‘time 2 ’ (the ^ means ‘to the power of’); because arith¬ 
metic operators such as +, *, — and ^ can be used to define the form of a model (e.g., 
satisfaction-gender + age + age*gender) we need to enclose ‘Time ^ 2’ within the I() func¬ 
tion so that R knows to treat it as an arithmetic operator rather than part of the model 
specification. The last model we looked at was called ARModel, and included the main 
effect of Time as a predctor. We can use update() to create a new model ( timeQuadratic) 
that adds the quadratic term to this model: 

timeQuadratic<-updateCARModel, + I(Time A 2)) 

We can create a model ( timeCubic ) that adds a cubic term in exactly the same way as 
for the quadratic trend. This time, we update the quadratic model (timeQuadratic) so that 
it includes time 3 , which is done the same as for the quadratic trend except that we specify 
time cubed rather than squared, ‘I(Time ^ 3)’. We can compare these two new models with 
the model that included only the linear trend of time (ARModel) using the anovaQ function 
and ask for a summary of the final model using the summary() and intervals() functions. 

timeCubic <-update(timeQuadratic, + I(Time A 3)) 
anova(ARModel, timeQuadratic, timeCubic) 
summary(timeCubic) 
intervals(timeCubic) 

Output 19.21 shows the model comparison. It is clear from this that adding the 
quadratic term to the model significantly improves the fit, x 2 (l) = 57.35, p < .0001; how¬ 
ever, adding in the cubic trend does not, x 2 (l) = 3.38, p = .066. 

Model df AIC BIC logLik Test L. Ratio p-value 

ARModel 1 7 1872.89 1901.466 -929.445 

timeQuadratic 2 8 1817.54 1850.202 -900.772 1 vs 2 57.347 <.0001 

timeCubic 3 9 1816.16 1852.901 -899.081 2 vs 3 3.382 0.0659 


Output 19.21 
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Looking at the summary of the final model, the fixed effects (Output 19.22) and the con¬ 
fidence intervals (Output 19.23) tell us that the linear, b = 1.55 (0.61, 2.48), t( 320) = 3.24, 
p < .01, and quadratic, b = —1.33 (-2.15, -0.50), t( 320) = -3.15, p < .01, both significantly 
described the pattern of the data over time; however, the cubic trend was not significant, 
b = 0.17 (-0.01, 0.35), t( 320) = 1.84, p > .05. This confirms what we already know from 
comparing the fit of successive models. The trend in the data is best described by a second- 
order polynomial, or a quadratic trend. This reflects the initial increase in life satisfaction 
6 months after finding a new partner but a subsequent reduction in life satisfaction at 12 
and 18 months after the start of the relationship (Figure 19.11). It’s worth remembering 
that this quadratic trend is only an approximation: if it were completely accurate then we 
would predict from the model that couples who had been together for 10 years would have 
negative life satisfaction, which is impossible given the scale we used to measure it. 

Linear mixed-effects model fit by maximum likelihood 
Data: restructuredData 

AIC BIC logLik 

1816.162 1852.902 -899.0808 

Random effects: 

Formula: -Time | Person 

Structure: General positive-definite, Log-Cholesky parametrization 
StdDev Corr 
(Intercept) 1.8826725 (Intr) 

Time 0.4051351 -0.346 

Residual 1.4572374 

Correlation Structure: AR(1) 

Formula: -Time | Person 
Parameter estimate(s): 

Phi 

0.1326346 

Fixed effects: Life_Satisfaction - Time + I(Time / '2) + I(Timers) 

Value Std.Error DF t-value p-value 
(Intercept) 6.634783 0.2230273 320 29.748744 0.0000 

Time 1.546635 0.4772221 320 3.240913 0.0013 

I (Time / '2) -1.326426 0.4209411 320 -3.151098 0.0018 

I(Timers) 0.171096 0.0929297 320 1.841131 0.0665 

Correlation: 

(Intr) Time I(T^2) 

Time -0.278 

I(Time~2) 0.139 -0.951 
I(Time^3) -0.098 0.896 -0.987 

Standardized Within-Group Residuals: 

Min Q1 Med Q3 Max 

-2.58597365 -0.54411056 -0.04373592 0.50525444 2.78413461 

Number of Observations: 438 
Number of Groups: 115 

Output 19.22 

The outputs for the final model also tell us about the random parameters in the model. 
First of all, the standard deviation of the random intercepts was 1.88 (1.49, 2.39). The 
fact that the 95% confidence interval doesn’t cross zero suggests that we were correct to 
assume that life satisfaction at baseline varied significantly across people. Also, the variance 
of slope of time varied significantly across people, SD = 0.41 (0.17, 0.96). The confidence 
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interval again does not cross zero, suggesting that the change in life satisfaction over time 
varied significantly across people too. Finally, the correlation between the slopes and 
intercepts, -0.35 (-0.67, 0.10) suggests that as intercepts increased, the slope decreased 
(although the confidence interval crosses zero so this trend is not significant). 

Approximate 95% confidence intervals 


Fixed effects: 


lower 

(Intercept) 6.19800573 

Time 0.61204293 

I(Time~2) -2.15079781 

I(Time~3) -0.01089790 

attr(,"label") 

[1] "Fixed effects:" 


est. 
6.6347826 
1.5466350 
-1.3264264 
0.1710958 


upper 
7.0715595 
2.4812271 
-0.5020551 
0.3530895 


Random Effects: 

Level: Person 

sd((Intercept)) 
sd(Time) 

cor((Intercept),Time) 


lower 
1.4852030 
0.1705194 
-0.6738687 


est. 

1.8826725 2 
0.4051351 0 
-0.3461486 0 


upper 

.38651276 

.96255585 

.09538264 


Correlation structure: 

lower est. upper 

Phi -0.1856231 0.1326346 0.4257069 
attr(,"label") 

[1] "Correlation structure:" 


Within-group standard error: 
lower est. upper 

1.173241 1.457237 1.809978 

Output 19.23 

Another way to test for trends over time is by converting Time to power polynomials. This 
is achieved with a simple function polyQ. Within this function you specify the variable that 
you want to be converted, and the number of polynomials you want (up to the number of 
time points that you measured minus 1). For example, poly (Time, 1) will create a linear trend, 
poly (Time, 2) creates a linear and quadratic and poly (Time, 3) creates a linear, quadratic and 
cubic trend. In our current example we had four time points so a cubic trend is the highest order 
polynomial that we can have; if we, for example, specified poly (Time, 4) we would get an error. 

Linear mixed-effects model fit by maximum likelihood 
Data: restructuredData 

AIC BIC logLik 

1816.162 1852.902 -899.0808 


Random effects: 

Formula: -Time | Person 

Structure: General positive-definite, Log-Cholesky parametrization 
StdDev Corr 

(Intercept) 1.8826725 (Intr) 

Time 0.4051351 -0.346 

Residual 1.4572374 


Correlation Structure: AR(1) 
Formula: -Time | Person 
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Parameter estimate(s): 
Phi 

0.1326346 


Fixed effects: Life_Satisfaction ~ poly(Time, 3) 




Value 

Std.Error 

DF 

t-value 

p-value 

(Intercept) 


5.938943 

0.182953 

320 

32.46157 

0.0000 

poly(Time, 

3)1 

-20.615766 

1.759420 

320 

-11.71736 

0.0000 

poly(Time, 

3)2 

-11.682913 

1.418904 

320 

-8.23376 

0.0000 

poly(Time, 

3)3 

2.439191 

1.324833 

320 

1.84113 

0.0665 


Correlation: 


poly(Time, 3)1 
poly(Time, 3)2 
poly(Time, 3)3 


(Intr) p(T,3)1 p(T,3)2 
-0.009 

-0.016 0.027 

0.004 -0.035 0.014 


Standardized Within-Group Residuals: 

Min Q1 Med Q3 Max 

-2.58597365 -0.54411056 -0.04373592 0.50525444 2.78413461 


Number of Observations: 438 
Number of Groups: 115 

Output 19.24 

The advantage of this method of creating polynomials is that the resulting predictors are 
orthogonal (i.e., independent). By using Time, Time 1 and Time 3 we create predictors that 
are highly correlated, but by using poly() we create predictors that are completely uncor¬ 
related. This means that we can evaluate one trend without it being affected by another. 
If we wanted to add our trends using this method we can again use the update() function 
to respecify the model with AR(1) covariance structure ( ARModel ). Remember that this 
model already has time as a predictor, and we need to get rid of this predictor and replace 
it with our polynomials. To do this we just respecify the outcome and predictors within the 
update function and this will overwrite the previous predictors: 

polyModel<-update(ARModel, poly(Time, 3)) 

Remember that ‘—’ means ‘predicted from’; the V means ‘use the same thing as in the 
existing model’. As such, poly (Time, 3) translates as ‘use the same outcome as in the 
existing model but predict it from poly (Time, 3)’. In other words, the previous predictor 
of Time that was specified for the ARModel will be replaced by the linear, quadratic and 
cubic polynomials for Time that are created by the poly() function. All other parts of the 
ARModel (i.e., the random effects and covariance structure) remain unchanged. 

Output 19.24 shows the model summary (summary(polymodel)) for the model includ¬ 
ing the polynomials. We won’t dwell on this output other than to say that if you compare 
it to Output 19.22 it shows the same profile of results: the linear (poly(Time, 3)1) and qua¬ 
dratic ( poly(Time, 3)2) trends are significant, whereas the cubic (poly(Time, 3)3) was just 
non-significant (p = .067). The regression coefficients are different (because these contrasts 
are orthogonal), but basically the same pattern or results emerges. 




SELF-TEST 

s We have used the updatef) function in this second 
example. To get some practice at specifying 
multilevel models, try building each of the models in 
this example but specifying each one in full. 
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Further analysis © 


It’s worth pointing out that I’ve kept this growth curve analysis simple to give you the 
basic tools. In the example I allowed only the linear term to have a random intercept and 
slopes, but given that we discovered that a second-order polynomial described the change 
in responses, we could redo the analysis and allow random intercepts and slopes for the 
second-order polynomial also. To do these we would just have to specify these terms in the 
random part of the model. If we were to do this it would make sense to add the random 
components one at a time and test whether they have a significant impact on the model by 
comparing the log-likelihood values or other fit indices. 

Also, the polynomials I have described are not the only ones that can be used. You could 
test for a logarithmic trend over time, or even an exponential one. 



CRAMMING SAM’S TIPS 


Growth models 


• Growth models are multilevel models in which changes in an outcome overtime are modelled using potential growth patterns. 

• These growth patterns can be linear, quadratic, cubic, logarithmic, exponential, or anything you like really. 

• The hierarchy in the data is that time points are nested within people (or other entities). As such, it’s a way of analysing 
repeated-measures data that have a hierarchical structure. 

• The anova() function can be used to compare the overall fit of hierarchical models. The resulting change in the log-likelihood 
and the significance of this change can be used to ascertain if the fit has been improved (a significant change equates to a 
significant improvement). The AIC and BIC can also be compared across models (but not significance tested). 

• The fixed effects tell you whether the growth functions that you have entered into the model significantly predict the outcome. 
If the p-value is less than .05 then the effect is significant. 

• The intervalsO function can be used to get confidence intervals for model parameters. These intervals can tell us how much 
intercepts and slopes varied over our level 1 variable, and whether this variance is significant (if the interval does not cross 
zero, it is significant). 

• An autoregressive covariance structure, AR(1), is often assumed in time course data such as that in growth models. 



Labcoat Leni’s Real Research 19.1 


A fertile gesture ® 


Miller, G., Tybur, J. M., & Jordan, D. B. (2007). Evolution and Human Behavior, 28, 375-381. 


Most female mammals experience a phase of ‘estrus’ during which they are more sexually receptive, precep¬ 
tive, selective and attractive. As such, the evolutionary benefit to this phase is believed to be to attract mates 
of superior genetic stock. However, some people have argued that this important phase became uniquely lost 
or hidden in human females. Testing these evolutionary ideas is exceptionally difficult, but Geoffrey Miller and 
his colleagues came up with an incredibly elegant piece of research that did just that. They reasoned that if the 
‘hidden-estrus’ theory is incorrect then men should find women most attractive during the fertile phase of their 
menstrual cycle compared to the pre-fertile (menstrual) and post-fertile (luteal) phase. 


(Continued) 
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(Continued) 


To measure how attractive men found women in an ecologically valid way, they came up with the ingenious idea 
of collecting data from women working at lap-dancing clubs. These women maximize their tips from male visitors 
by attracting more dances. In effect the men ‘try out’ several dancers before choosing a dancer for a prolonged 
dance. For each dance the male pays a 'tip'. Therefore, the greater the number of men choosing a particular 
woman, the more her earnings will be. As such, each dancer’s earnings are a good index of how attractive the 
male customers have found her. Miller and his colleagues argued, therefore, that if women do have an estrus 
phase then they will be more attractive during this phase and therefore earn more money. This study is a brilliant 
example of using a real-world phenomenon to address an important scientific question in an ecologically valid way. 

The data for this study are in the file Miller et al. (2007).dat The researchers collected data via a website 
from several dancers (ID), who provided data for multiple lap-dancing shifts (so for each person there are sev¬ 
eral rows of data). They also measured what phase of the menstrual cycle the women were in at a given shift 
(Cyclephase), and whether they were using hormonal contraceptives (Contraceptive) because this would 
affect their cycle. The outcome was their earnings on a given shift in dollars (Tips). 

A multilevel model can be used here because the data are unbalanced: the women differed in the number of 
shifts they provided data for (the range was 9 to 29 shifts); multilevel models can handle this problem. 

Labcoat Leni wants you to carry out a multilevel model to see whether Tips can be predicted 
from Cyclephase, Contraceptive and their interaction. Is the ‘estrus-hidden’ hypothesis supported? 
Answers are in the additional material on the companion website (or look at page 378 in the original 
article). 



19.8. How to report a multilevel model © 


Specific advice on reporting multilevel models is hard to come by. Also, the models themselves 
can take on so many forms that giving standard advice is hard. If you have built up your model 
from one with only fixed parameters to one with a random intercept, and then random slope, 
it is advisable to report all stages of this process (or at the very least report the fixed-effects- 
only model and the final model). For any model you need to say something about the random 
effects. For the final model of the cosmetic surgery example you could write something like: 

The relationship between surgery and quality of life showed significant variance in 
intercepts across participants, SD = 5.48 (95% Cl: 3.31, 9.07), x 2 (l) = 107.65, p < 
.0001. In addition, the slopes varied across participants, SD = 5.42 (3.13, 9.37), x 2 
(2) = 38.87, p < .0001, and the slopes and intercepts were negatively and significantly 
correlated, cor = -.95 (-.99, -.60). 

For the model itself, you have two choices. The first is to report the results in the text, 
with the ^-values, ts and degrees of freedom for the fixed effects, and then report the 
parameters for the random effects in the text as well. The second is to produce a table of 
parameters as you would for regression. For example, we might report our cosmetic sur¬ 
gery example as follows: 

^ Quality of life before surgery significantly predicted quality of life after surgery, b 
= 0.31, £(262) = 5.75, p < .001, surgery did not significantly predict quality of life, 
b = -3.19, £(262) =-1.46, p = .15, but the reason for surgery, b = -3.52, £(262) = -3.08, 
p < .01, and the interaction of the reason for surgery and surgery, b = 4.22, £(262) = 
2.48, p < .05, both did significantly predict quality of life. This interaction was broken 
down by conducting separate multilevel models on the ‘physical reason’ and ‘attrac¬ 
tiveness reason’. The models specified were the same as the main model but excluded 
the main effect and interaction term involving the reason for surgery. These analyses 
showed that for those operated on only to change their appearance, surgery almost 
significantly predicted quality of life after surgery, b = —4.31, £(87) = —1.89, p = .06: 
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quality of life was lower after surgery compared to the control group. However, for 
those who had surgery to solve a physical problem, surgery did not significantly pre¬ 
dict quality of life, b = 1.20,7(166) = 0.57, p — .57. The interaction effect, therefore, 
reflects the difference in slopes for surgery as a predictor of quality of life in those who 
had surgery for physical problems (slight positive slope) and those who had surgery 
purely for vanity (a negative slope). 

Alternatively we could present parameter information in a table: 


SE b 95% Cl 


Baseline QoL 0.31 0.05 0.20,0.41 


Surgery 

-3.19 

2.19 

-7.45, 1.08 

Reason 

-3.52 

1.14 

-5.74, -1.29 

Surgery x Reason 

4.22 

1.70 

0.90, 7.54 



What have I discovered about statistics? © 


Writing this chapter was quite a steep learning curve for me. I’ve been meaning to learn 
about multilevel modelling for ages, and now I finally feel like I know something. This is 
pretty amazing considering that the bulk of the reading and writing was done between 
11pm and 3 am over many nights. However, despite now feeling as though I understand 
them, I don’t, and if you feel like you now understand them then you’re wrong. This 
sounds harsh, but sadly multilevel modelling is very complicated and we have scratched 
only the surface of what there is to know. Multilevel models often fail to converge, with no 
apology or explanation, and trying to fathom out what’s happening can feel like hammer¬ 
ing nails into your head. 

Needless to say, I didn’t mention any of this at the start of the chapter because I wanted 
you to read it. Instead, I lulled you into a false sense of security by looking gently at how 
data can be hierarchical and how this hierarchical structure can be important. Most of 
the tests in this book simply ignore the hierarchy. We also saw that hierarchical models 
are just basically a fancy regression in which you can estimate the variability in the slopes 
and intercepts within entities. We saw that you should start with a model that ignores the 
hierarchy and then add in random intercepts and slopes to see if they improve the fit of 
the model. Having submerged ourselves in the warm bath of standard multilevel mod¬ 
els, we moved on to the icy lake of growth curves. We saw that there are ways to model 
trends in the data over time (and that these trends can also have variable intercepts and 
slopes). We also discovered that these trends have long confusing names like fourth-order 
polynomial. We asked ourselves why they couldn’t have a sensible name, like Kate. In 
fact, we decided to ourselves that we’d secretly call a linear trend Kate, a quadratic trend 
Benjamin, a cubic trend Zoe, and a fourth-order trend Doug. ‘That will show the statisti¬ 
cians’ we thought to ourselves, and felt a little bit self-satisfied too. 

We also saw that after years of denial, my love of making a racket got the better of me. 
This brings my life story up to date. Admittedly I left out some of the more colourful bits, 
but only because I couldn’t find an extremely tenuous way to link them to statistics. We 
saw that over my life I managed to completely fail to achieve any of my childhood dreams. 
It’s OK, I have other ambitions now (a bit smaller scale than ‘rock star’) and I’m looking 
forward to failing to achieve them too. The question that remains is whether there is life 
after Discovering Statistics. What effect does writing a statistics book have on your life? 
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R packages used in this chapter 


car 

nlme 

ggplot2 

reshape 

R functions used in this chapter 

aov() 

lme() 

anova() 

loglik() 

ggpioto 

meltO 

gisO 

p°iy() 

10 

summary!) 

intervals!) 

updateQ 

ImQ 



Key terms that I’ve discovered 

AIC 

Group mean centring 

AR(1) 

Growth curve 

BIC 

Multilevel linear model 

Centring 

Polynomial 

Diagonal 

Random coefficient 

Fixed coefficient 

Random effect 

Fixed effect 

Random intercept 

Fixed intercept 

Random slope 

Fixed slope 

Random variable 

Fixed variable 

Unstructured 

Grand mean centring 

Variance components 


Smart Alex’s tasks 



• Task 1: Using the cosmetic surgery example, run the analysis described in section 
19.6.9 but also including BDI, Age and Gender as fixed effect predictors. What dif¬ 
ferences does including these predictors make? © 

• Task 2: Using our growth model example in this chapter, analyse the data but include 
Gender as an additional covariate. Does this change your conclusions? © 

• Task 3: Getting kids to exercise (Hill, Abraham, & Wright, 2007): The purpose of 
this research was to examine whether providing children with a leaflet based on the 
‘theory of planned behaviour’ increases children’s exercise. There were four differ¬ 
ent interventions (Intervention): a control group, a leaflet, a leaflet and quiz, and a 
leaflet and plan. A total of 503 children from 22 different classrooms were sampled 
(Classroom). It was not practical to have children in the same classrooms in different 
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conditions, therefore the 22 classrooms were randomly assigned to the four differ¬ 
ent conditions. Children were asked ‘On average over the last three weeks, I have 
exercised energetically for at least 30 minutes_times per week’ after the inter¬ 

vention (Post_Exercise). Run a multilevel model analysis on these data (Hill et al. 
(2007).dat) to see whether the intervention affected the children’s exercise levels (the 
hierarchy in the data is: children within classrooms within interventions). © 

• Task 4: Repeat the above analysis but include the pre-intervention exercise scores 
(Pre_Exercise) as a covariate. What difference does this make to the results? © 

Answers can be found on the companion website. 



Further reading 


Kreft, I., & de Leeuw, J. (1998). Introducing multilevel modeling. London: Sage. (This is a fantastic 
book that is easy to get into but has a lot of depth too.) 

Tabachnick, B. G., & Fidell, L. S. (2001). Using multivariate statistics (4th ed.). Boston: Allyn & 
Bacon. (Chapter 15 is a fantastic account of multilevel linear models that goes a bit more in depth 
than I do.) 

Twisk, J. W R. (2006). Applied multilevel analysis: A practical guide. Cambridge: Cambridge 
University Press. (An absolutely superb introduction to multilevel modelling. This book is excep¬ 
tionally clearly written and is aimed at novices. Without question, this is the best beginner’s guide 
that I have read.) 


Interesting real research 


Cook, S. A., Rosser, R., & Salmon, P. (2006). Is cosmetic surgery an effective psychotherapeutic 
intervention? A systematic review of the evidence. Journal of Plastic, Reconstructive & Aesthetic 
Surgery, 59, 1133-1151. 

Miller, G., Tybur, J. M., & Jordan, B. D. (2007). Ovulatory cycle effects on tip earnings by lap dancers: 
Economic evidence for human estrus? Evolution and Human Behavior, 28, 375-381. 






Epilogue: life after 
discovering statistics 



Here’s some questions that the writer sent 
Can an observer be a participant? 

Have I seen too much? 

Does it count if it doesn’t touch? 

If the view is all I can ascertain, 

Pure understanding is out of range 

(Fugazi, ‘Ex Spectator’, The Argument, 2001) 

When I wrote the SPSS version of this book my main ambition was to write a statistics book 
that I would enjoy reading. Pretty selfish, I know. I thought that if I had a reference book 
that had a few examples that amused me then it would make life a lot easier when I needed 
to look something up. I honestly didn’t think anyone would buy the thing (well, apart 
from my mum and dad) and I anticipated a glut of feedback along the lines of ‘the whole 
of Chapter X is completely wrong and you’re an arrant fool’, or ‘you should be ashamed 
of how many trees have died in the name of this rubbish, you brainless idiot’. In fact, even 
the publishers didn’t think it would sell (they have only revealed this subsequently, I might 
add). There are several other things that I didn’t expect to happen: 

1 Nice emails: I didn’t expect to receive hundreds of extremely nice emails from people 
who liked the book. To this day it still absolutely amazes me that anyone reads it, let 
alone takes the time to write me a nice email, and knowing that the book has helped 
people always puts a huge smile on my face. When the nice comments are followed 
by four pages of statistics questions the smile fades a bit ... 

2 Everybody thinks that I’m a statistician: I should have seen this one coming really, 
but since writing a statistics textbook everyone assumes that I’m a statistician. I’m 
not, I’m a psychologist. Consequently, I constantly disappoint people by not being 
able to answer their statistics questions. In fact, this book is the sum total of my 
knowledge about statistics; there is nothing else (statistics-wise) in my brain that 
isn’t in this book. Actually, that’s a lie: there is more in this book about statistics than 
in my brain. For example, in the logistic regression chapter there is an example on 
multinomial logistic regression. To write this new section I read a lot about multi¬ 
nomial logistic regression because I’d never used it. I wrote that section about three 
years ago, and I’ve now forgotten everything that I wrote. Should I ever need to do 
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a multinomial logistic regression I will read the chapter in this book and think to 
myself ‘wow, it really sounds as though I know what I’m talking about’. 

3 Craziness on a grand scale : The nicest thing about life after discovering statistics is the 
effort that people go to to demonstrate that they are even stranger than me. All of these 
people have made life after Discovering statistics a profoundly enjoyable experience. 

• Catistics: I’ve had quite a few photographs of people’s cats (and dogs) reading my 
book (check them out, or post some new ones, on my Facebook page at http://www. 
facebook.com/ProfAndyField). There has been many a week where one of these in 
my inbox has turned what was going to be a steaming turd of a day into a fragrant 
romp through fields of tulips. How can you not get a big stupid grin on your face 
when you see these? 

• Facebook: Two particularly strange people from Exeter (UK) whom I have never 
met set up an ‘Andy Field appreciation society’ on Facebook. I don’t go there 
much because it scares me a bit. But secretly I think it is quite cool. It’s almost like 
being the rock star that I always wanted to be, except that when people join a rock 
star’s appreciation society they mean it, but people join mine because it’s funny. 
Nevertheless, beggars can’t be choosers and I’m happy to overlook a technicality 
such as the truth if it means that I can believe that I’m popular. 

• Films : Possibly the strangest thing to have happened is Julie-Renee Kabriel and 
her bonkers friends from Washburn University producing a video homage to 
‘Discovering Stats’ (http://www.youtube.com/watch?v=oLsrt594Xxc). I was in 
equal parts crippled with laughter and utterly bemused watching this video. My 
parents liked it too. (Oddly enough, it’s to the tune of ‘Sweet Home Alabama’ by 
Lynyrd Skynyrd; I once gave a talk at Aberdeen University (UK) after which I got 
taken to a bar and ended up (quite unexpectedly) playing drums to that song with a 
makeshift band of complete strangers.) 

• Invitation to an autopsy: I got invited to an autopsy. Really! Some (very nice) foren¬ 
sic scientists in Leicester loved this book so much that they felt that I needed to be 
rewarded for my efforts. They felt that the most appropriate reward would be to offer 
to take me to see a dead body being carved up (or to spend a day visiting crime scenes). 
In a strange way, I can see their logic. I haven’t been because I’m slightly scared that 
it’s a cruel trick and that it will turn out to be my body on the slab. However, in the 
interests of having a good story for the next edition I might just go ... 

• Befriended by Satan: I got an email from the then manager of a black metal band, 
Abgott, who, while using my book for her studies, was impressed to see that I 
like black metal bands. My band was playing the next week in 
London and, never one to miss an opportunity, I invited her to 
come along. She not only turned up, but bought some of the 
band and some free CDs. They renamed me ‘The Evil Statistic’. 

Buy their albums, buy their albums, buy their albums ... 

Life after Discovering statistics never ceases to amuse me. I never 
dreamed for a second that I’d be writing numerous editions and an 
adaptation of it for R. I would recommend writing a statistics book to 
anyone: it changes your life. You get a constant warm fuzzy feeling from 
being told that you’ve helped people, strangers send you photos of their 
pets, they make films about you, they give you CDs, you get an apprecia¬ 
tion society, you can go to see corpses being cut up, join a black metal 
band (well, maybe not, but if my drumming improves and their drum¬ 
mer’s arms and legs fall off, who knows?) and have people constantly 
overestimate your intelligence. Long may the craziness continue. 





Troubleshooting R 



Have you installed 
and loaded the 
necessary package? 

V_ II _ J 



Check there are no 
spaces in the variable 
names in the original 
data file 


Have you told R what to do 
with missing cases (e.g., 
use na.rm = TRUE, or 
na.action = na.exclude)? 


Try sourcing the 
functions from his 
web page (Chapter 5) 
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Glossary 



O: the amount of a clue that Sage have 
about how much effort I put into 
writing this book. 

-2LL: the log-likelihood multiplied 
by minus 2. This version of the 
likelihood is used in logistic 
regression. 

a-level: the probability of making a Type 
I error (usually this value is .05). 

A life: what you don’t have when writing 
statistics textbooks. 

Adjusted mean: in the context of 
analysis of covariance this is the 
value of the group mean adjusted for 
the effect of the covariate. 

Adjusted predicted value: a measure 
of the influence of a particular case 
of data. It is the predicted value of a 
case from a model estimated without 
that case included in the data. The 
value is calculated by re-estimating 
the model without the case in 
question, then using this new model 
to predict the value of the excluded 
case. If a case does not exert a large 
influence over the model then its 
predicted value should be similar 
regardless of whether the model was 
estimated including or excluding that 
case. The difference between the 
predicted value of a case from the 
model when that case was included 
and the predicted value from the 
model when it was excluded is the 
DFFit. 

Adjusted R 2 : a measure of the loss 
of predictive power or shrinkage in 
regression. The adjusted R 2 tells us 
how much variance in the outcome 
would be accounted for if the 
model had been derived from the 
population from which the sample 
was taken. 


AIC (Akaike’s information criterion): 

a goodness-of-fit measure that is 
corrected for model complexity. 

That just means that it takes into 
account how many parameters have 
been estimated. It is not intrinsically 
interpretable, but can be compared 
in different models to see how 
changing the model affects the fit. A 
small value represents a better fit of 
the data. 

Alpha factoring: a method of factor 
analysis. 

Alternative hypothesis: the prediction 
that there will be an effect (i.e., that 
your experimental manipulation will 
have some effect or that certain 
variables will relate to each other). 

Analysis of covariance: a statistical 
procedure that uses the F-ratio to 
test the overall fit of a linear model, 
controlling for the effect that one 
or more covariates have on the 
outcome variable. In experimental 
research this linear model tends to 
be defined in terms of group means, 
and the resulting ANOVA is therefore 
an overall test of whether group 
means differ after the variance in the 
outcome variable explained by any 
covariates has been removed. 

Analysis of variance: a statistical 
procedure that uses the F-ratio to 
test the overall fit of a linear model. 

In experimental research this linear 
model tends to be defined in terms 
of group means, and the resulting 
ANOVA is therefore an overall test of 
whether group means differ. 

ANCOVA: acronym for analysis of 
covariance. 

ANOVA: acronym for analysis of 
variance. 


AR(1): this stands for first-order 
autoregressive structure. It is 
a covariance structure used in 
multilevel models in which the 
relationship between scores changes 
in a systematic way. It is assumed 
that the correlation between scores 
gets smaller over time and variances 
are assumed to be homogeneous. 
This structure is often used for 
repeated-measures data (especially 
when measurements are taken over 
time such as in growth models). 

Autocorrelation: when the residuals 
of two observations in a regression 
model are correlated. 

b.: unstandardized regression 

coefficient. Indicates the strength 
of relationship between a given 
predictor, /, and an outcome in 
the units of measurement of the 
predictor. It is the change in the 
outcome associated with a unit 
change in the predictor. 

p-. standardized regression coefficient. 
Indicates the strength of relationship 
between a given predictor, /, and 
an outcome in a standardized form. 

It is the change in the outcome (in 
standard deviations) associated with 
a one standard deviation change in 
the predictor. 

/i-level: the probability of making a Type 
II error (Cohen, 1992, suggests a 
maximum value of .2). 

Bar chart: a graph in which a summary 
statistic (usually the mean) is 
plotted on the y-axis against a 
categorical variable on the x-axis 
(this categorical variable could 
represent, for example, groups of 
people, different times or different 
experimental conditions). A bar 
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shows the value of the mean for 
each category. Different-coloured 
bars may be used to represent levels 
of a second categorical variable. 

Bartlett’s test of sphericity: 
unsurprisingly this is a test of the 
assumption of sphericity. This test 
examines whether a variance- 
covariance matrix is proportional 
to an identity matrix. Therefore, 
it effectively tests whether the 
diagonal elements of the variance- 
covariance matrix are equal (i.e., 
group variances are the same), 
and that the off-diagonal elements 
are approximately zero (i.e., the 
dependent variables are not 
correlated). Jeremy Miles, who does 
a lot of multivariate stuff, claims 
he’s never ever seen a matrix that 
reached non-significance using this 
test and, come to think of it, I've 
never seen one either (although I do 
less multivariate stuff) so you’ve got 
to wonder about its practical utility. 

Beer-goggles effect: the phenomenon 
that people of the opposite gender 
(or the same, depending on your 
sexual orientation) appear much 
more attractive after a few alcoholic 
drinks. 

Between-group design: another name 
for independent design. 

Between-subject design: another 
name for independent design. 

BIC (Schwarz’s Bayesian criterion): 

a goodness-of-fit statistic 
comparable to the AIC, although 
it is slightly more conservative (it 
corrects more harshly for the number 
of parameters being estimated). 

It should be used when sample 
sizes are large and the number 
of parameters is small. It is not 
intrinsically interpretable, but can be 
compared in different models to see 
how changing the model affects the 
fit. A small value represents a better 
fit of the data. 

Bimodal: a description of a distribution 
of observations that has two modes. 

Binary logistic regression: logistic 
regression in which the outcome 
variable has exactly two categories. 

Binary variable: a categorical variable 
that has only two mutually exclusive 
categories (e.g., being dead or 
alive). 

Biserial correlation: a standardized 
measure of the strength of 
relationship between two variables 
when one of the two variables is 
dichotomous. The biserial correlation 


coefficient is used when one variable 
is a continuous dichotomy (e.g., has 
an underlying continuum between 
the categories). 

Bivariate correlation: a correlation 
between two variables. 

Bonferroni correction: a correction 
applied to the a-level to control the 
overall Type I error rate when multiple 
significance tests are carried out. 
Each test conducted should use a 
criterion of significance of the a-level 
(normally .05) divided by the number 
of tests conducted. This is a simple 
but effective correction, but tends to 
be too strict when lots of tests are 
performed. 

Bootstrap: a technique from which the 
sampling distribution of a statistic 
is estimated by taking repeated 
values (with replacement) from the 
data set (so in effect, treating the 
data as a population from which 
smaller samples are taken). The 
statistic of interest (e.g., the mean 
or the b coefficient) is calculated 
for each sample, from which the 
sampling distribution of the statistic 
is estimated. The standard error of 
the statistic is estimated as 
the standard deviation of the 
sampling distribution created 
from the bootstrap samples. From 
this, confidence intervals and 
significance tests can be computed. 

Boredom effect: refers to the 

possibility that performance in tasks 
may be influenced (the assumption 
is a negative influence) by boredom 
or lack of concentration if there are 
many tasks or if the task goes on for 
a long period of time. In short, what 
you are experiencing reading this 
glossary is a boredom effect. 

Box’s test: a test of the assumption of 
homogeneity of covariance matrices. 
This test should be non-significant if 
the matrices are roughly the same. 
Box’s test is very susceptible to 
deviations from multivariate normality 
and so can be non-significant, not 
because the variance-covariance 
matrices are similar across groups, 
but because the assumption of 
multivariate normality is not tenable. 
Hence, it is vital to have some 
idea of whether the data meet the 
multivariate normality assumption 
(which is extremely difficult) before 
interpreting the result of Box’s test. 

Boxplot (a.k.a. box-whisker 
diagram): a graphical 
representation of some important 


characteristics of a set of 
observations. At the centre of 
the plot is the median, which is 
surrounded by a box, the top and 
bottom of which are the limits 
within which the middle 50% of 
observations fall (the interguartile 
range). Sticking out of the top and 
bottom of the box are two whiskers, 
which extend to the most and least 
extreme scores respectively. 

Box-whisker plot: see boxplot. 

Categorical variable: any variable 
made up of categories of 
objects/entities. The UK degree 
classifications are a good example 
because degrees are classified as 
1,2:1,2:2, 3, pass or fail. Therefore, 
graduates form a categorical 
variable because they will fall 
into only one of these categories 
(hopefully the category of students 
receiving a first!). 

Central limit theorem: this theorem 
states that when samples are large 
(above about 30) the sampling 
distribution will take the shape of 
a normal distribution regardless of 
the shape of the population from 
which the sample was drawn. For 
small samples the f-distribution 
better approximates the shape of the 
sampling distribution. We also know 
from this theorem that the standard 
deviation of the sampling distribution 
(i.e., the standard error of the sample 
mean) will be equal to the standard 
deviation of the sample (s) divided 
by the square root of the sample 
size (A/). 

Central tendency: a generic term 
describing the centre of a frequency 
distribution of observations as 
measured by the mean, mode and 
median. 

Centring: the process of transforming 
a variable into deviations around a 
fixed point. This fixed point can be 
any value that is chosen, but typically 
a mean is used. To centre a variable 
the mean is subtracted from each 
score. See grand mean centring, 
group mean centring. 

Chartjunk: superfluous material 
that distracts from the data being 
displayed on a graph. 

Chi-square distribution: a probability 
distribution of the sum of squares 
of several normally distributed 
variables. It tends to be used to 1) 
test hypotheses about categorical 
data, and 2) test the fit of models to 
the observed data. 
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Chi-square test: although this term 
can apply to any test statistic having 
a chi-square distribution, it generally 
refers to Pearson’s chi-square 
test of the independence of two 
categorical variables. Essentially 
it tests whether two categorical 
variables forming a contingency 
table are associated. 

Cocaine: the drug of choice at Sage. 
They inject it into their eyeballs, you 
know. 

Coefficient of determination: the 

proportion of variance in one variable 
explained by a second variable. It is 
the Pearson correlation coefficient 
squared. 

Common variance: variance shared by 
two or more variables. 

Communality: the proportion of a 
variable's variance that is common 
variance. This term is used primarily 
in factor analysis. A variable that 
has no unique variance (or random 
variance) would have a communality 
of 1, whereas a variable that shares 
none of its variance with any other 
variable would have a communality 
of 0. 

Complete separation: a situation in 
logistic regression when the outcome 
variable can be perfectly predicted 
by one predictor or a combination 
of predictors. Suffice it to say this 
situation makes your computer 
have the equivalent of a nervous 
breakdown: it’ll start gibbering, 
weeping and saying it doesn’t know 
what to do. 

Component matrix: general term for 
the structure matrix in R principal 
components analysis. 

Compound symmetry: a condition that 
holds true when both the variances 
across conditions are equal (this 
is the same as the homogeneity 
of variance assumption) and the 
covariances between pairs of 
conditions are also equal. 

Confidence interval: for a given 
statistic calculated for a sample of 


observations (e.g., the mean), the 
confidence interval is a range of 
values around that statistic that are 
believed to contain, with a certain 
probability (e.g., 95%), the true value 
of that statistic (i.e., the population 
value). 

Confirmatory factor analysis (CFA): 

a version of factor analysis in which 
specific hypotheses about structure 
and relations between the latent 
variables that underlie the data are 
tested. 

Confounding variable: a variable 
(that we may or may not have 
measured) other than the predictor 
variables in which we’re interested 
that potentially affects an outcome 
variable. 

Console window: The main window 
in R. This window contains the 
command line, which can be used 
to type and execute commands, but 
it is also the window in which text 
output from executed commands is 
displayed. 

Content validity: evidence that the 
content of a test corresponds to 
the content of the construct it was 
designed to cover. 

Contingency table: a table 

representing the cross-classification 
of two or more categorical variables. 
The levels of each variable are 
arranged in a grid, and the number 
of observations falling into each 
category is noted in the cells of the 
table. For example, if we took the 
categorical variables of glossary 
(with two categories: whether an 
author was made to write a glossary 
or not), and mental state (with 
three categories: normal, sobbing 
uncontrollably and utterly psychotic), 
we could construct a table as 
below. This instantly tells us that 
127 authors who were made to 
write a glossary ended up as 
utterly psychotic, compared to 
only 2 who did not write 
a glossary. 


Continuous variable: a variable that 
can be measured to any level of 
precision. (Time is a continuous 
variable, because there is in principle 
no limit on how finely it could be 
measured.) 

Cook’s distance: a measure of the 
overall influence of a case on a 
model. Cook and Weisberg (1982) 
have suggested that values greater 
than 1 may be cause for concern. 

Correlation coefficient: a measure 
of the strength of association or 
relationship between two variables. 
See Pearson's correlation coefficient, 
Spearman’s correlation coefficient, 
Kendall's tau. 

Correlational research: a form of 
research in which you observe what 
naturally goes on in the world without 
directly interfering with it. This term 
implies that data will be analysed so 
as to look at relationships between 
naturally occurring variables rather 
than making statements about cause 
and effect. Compare with cross- 
sectional research and experimental 
research. 

Counterbalancing: a process of 
systematically varying the order in 
which experimental conditions are 
conducted. In the simplest case of 
there being two conditions (A and 
B), counterbalancing simply implies 
that half of the participants complete 
condition A followed by condition B, 
whereas the remainder do condition 
B followed by condition A. The aim 
is to remove systematic bias caused 
by practice effects or boredom 
effects. 

Covariance: a measure of the ’average’ 
relationship between two variables. 

It is the average cross-product 
deviation (i.e., the cross-product 
divided by one less than the number 
of observations). 

Covariance ratio (CVR): a measure 
of whether a case influences the 
variance of the parameters in a 
regression model. When this ratio 



Glossary 




Author made to write 
glossary 

No glossary 

Total 

Normal 

5 

423 

428 

Sobbing uncontrollably 

Mental state 

Utterly psychotic 

23 

46 

69 

127 

2 

129 

Total 

155 

471 

626 
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is close to 1 the case is having very 
little influence on the variances of 
the model parameters. Belsey et al. 
(1980) recommend the following: 
if the CVR of a case is greater than 
1 + [3 (k + 1)/n] then deleting that 
case will damage the precision of 
some of the model’s parameters, 
but if it is less than 1 - [3(k + 1 )/n] 
then deleting the case will improve 
the precision of some of the model's 
parameters (k is the number of 
predictors and n is the sample size). 

Covariate: a variable that has a 
relationship with (in terms of 
covariance ), or has the potential to 
be related to, the outcome variable 
we’ve measured. 

Cox and Snell’s R 2 CS : a version of 
the coefficient of determination 
for logistic regression. It is based 
on the log-likelihood of a model 
(LC(new)) and the log-likelihood of 
the original model (/./.(baseline)), 
and the sample size, n. However, 
it is notorious for not reaching 
its maximum value of 1 (see 
Nagelkerke’s R ®). 

CRAN (Comprehensive R Archive 
Network): a virtual warehouse that 
stores the R software, packages 
associated with it, documentation 
and code. 

Criterion validity: evidence that scores 
from an instrument correspond 
with or predict concurrent external 
measures conceptually related to the 
measured construct. 

Cronbach’s a: a measure of the 
reliability of a scale defined by 

N z Cov 

a= -o- 

Xs 2 +XCov 

item item 

in which the top half of the equation 
is simply the number of items ( N) 
squared multiplied by the average 
covariance between items (the 
average of the off-diagonal elements 
in the variance-covariance matrix). 
The bottom half is the sum of all the 
elements in the variance-covariance 
matrix. 

Cross-product deviations: a measure 
of the ’total’ relationship between two 
variables. It is the deviation of one 
variable from its mean multiplied by 
the other variable's deviation from 
its mean. 

Cross-sectional research: a form 
of research in which you observe 
what naturally goes on in the world 


without directly interfering with it. 

This term specifically implies that 
data come from people at different 
age points with different people 
representing each age point. See 
also correlational research. 

Cross-validation: assessing the 
accuracy of a model across 
different samples. This is an 
important step in generalization. In 
a regression model there are two 
main methods of cross-validation: 
adjusted R 2 or data splitting, in 
which the data are split randomly 
into two halves, and a regression 
model is estimated for each half 
and then compared. 

Crying: what you feel like doing after 
writing statistics textbooks. 

Cubic trend: if you connected the 
means in ordered conditions with 
a line then a cubic trend is shown 
by two changes in the direction of 
this line. You must have at least four 
ordered conditions. 

Dataframe: an object containing 
variables. It differs from a matrix in 
that the variables can be of differing 
types (e.g., you can have string 
variables and numeric variables in 
the same dataframe but not in the 
same matrix). 

Date variable: variables made up of 
dates. The data can take forms such 
as dd-mmm-yyyy (e.g., 21-Jun- 
1973), dd-mmm-yy (e.g., 21-Jun-73), 
mm/dd/yy (e.g., 06/21/73) or dd.mm. 
yyyy (e.g., 21.06.1973). 

Degrees of freedom: an impossible 
thing to define in a few pages let 
alone a few lines. Essentially it is 
the number of ’entities’ that are free 
to vary when estimating some kind 
of statistical parameter. In a more 
practical sense, it has a bearing 
on significance tests for many 
commonly used test statistics (such 
as the F-ratio, t-test, chi-square 
statistic) and determines the exact 
form of the probability distribution for 
these test statistics. The explanation 
involving rugby players in Chapter 2 
is far more interesting... 

Deleted residual: a measure of the 
influence of a particular case of 
data, it is the difference between the 
adjusted predicted value for a case 
and the original observed value for 
that case. 

Density plot: similar to a histogram 
except that, rather than having 
a summary bar representing the 


frequency of scores, it shows each 
individual score as a dot. They can 
be useful for looking at the shape of 
a distribution of scores. 

Dependent t-test: a test using the 
t-statistic that establishes whether 
two means collected from the same 
sample (or related observations) 
differ significantly. 

Dependent variable: another 
name for outcome variable. This 
name is usually associated with 
experimental methodology (which 
is the only time it really makes 
sense) and is so called because 
it is the variable that is not 
manipulated by the experimenter 
and so its value depends on 
the variables that have been 
manipulated. To be honest I just 
use the term outcome variable all 
the time - it makes more sense (to 
me) and is less confusing. 

Deviance: the difference between the 
observed value of a variable and the 
value of that variable predicted by a 
statistical model. 

DFA: acronym for discriminant function 
analysis (see discriminant analysis). 

DFBeta: a measure of the influence 
of a case on the values of b in a 
regression model. If we estimated 
a regression parameter b. and 
then deleted a particular case and 
re-estimated the same regression 
parameter b., then the difference 
between these two estimates would 
be the DFBeta for the case that was 
deleted. By looking at the values 
of the DFBetas, it is possible to 
identify cases that have a large 
influence on the parameters of the 
regression model; however, the size 
of DFBeta will depend on the units 
of measurement of the regression 
parameter. 

DFFit: a measure of the influence of a 
case. It is the difference between 
the adjusted predicted value and 
the original predicted value of a 
particular case. If a case is not 
influential then its DFFit should be 
zero - hence, we expect non- 
influential cases to have small 
DFFit values. However, we have 
the problem that this statistic 
depends on the units of 
measurement of the outcome and 
so a DFFit of 0.5 will be very small 
if the outcome ranges from 1 to 
100, but very large if the outcome 
varies from 0 to 1. 
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Diagonal: a covariance structure used 
in multilevel models. In this variance 
structure variances are assumed 
to be heterogeneous and all of the 
covariances are 0. 

Dichotomous: description of a 
variable that consists of only two 
categories (e.g., the variable gender 
is dichotomous because it consists 
of only two categories: male and 
female). 

Direct oblimin: a method of oblique 
rotation. 

Discrete variable: a variable that can 
only take on certain values (usually 
whole numbers) on the scale. 

Discriminant analysis: also known as 
discriminant function analysis. This 
analysis identifies and describes 
the discriminant function variates 
of a set of variables and is useful 
as a follow-up test to MANOVA 
as a means of seeing how these 
variates allow groups of cases to be 
discriminated. 

Discriminant function variate: a linear 
combination of variables created 
such that the differences between 
group means on the transformed 
variable are maximized. It takes the 
general form: 

variate 1( . = b^X v + b 2 X 2j + ... + bX nj 

Discriminant score: a score for an 
individual case on a particular 
discriminant function variate obtained 
by replacing that case’s scores on 
the measured variables into the 
equation that defines the variate in 
question. 

Dummy variables: a way of recoding 
a categorical variable with more 
than two categories into a series 
of variables all of which are 
dichotomous and can take on values 
of only 0 or 1. There are seven basic 
steps to create such variables: (1) 
count the number of groups you 
want to recode and subtract 1; (2) 
create as many new variables as 
the value you calculated in step 1 
(these are your dummy variables); 

(3) choose one of your groups as 
a baseline (i.e., a group against 
which all other groups should be 
compared, such as a control group); 

(4) assign that baseline group values 
of 0 for all of your dummy variables; 

(5) for your first dummy variable, 
assign the value 1 to the first group 
that you want to compare against 
the baseline group (assign all other 


groups 0 for this variable); (6) for 
the second dummy variable assign 
the value 1 to the second group 
that you want to compare against 
the baseline group (assign all other 
groups 0 for this variable); (7) repeat 
this process until you run out of 
dummy variables. 

Durbin-Watson test: tests for serial 
correlations between errors in 
regression models. Specifically, it 
tests whether adjacent residuals 
are correlated, which is useful 
in assessing the assumption 
of independent errors. The test 
statistic can vary between 0 and 
4, with a value of 2 meaning that 
the residuals are uncorrelated. 

A value greater than 2 indicates 
a negative correlation between 
adjacent residuals, whereas a 
value below 2 indicates a positive 
correlation. The size of the Durbin- 
Watson statistic depends upon the 
number of predictors in the model 
and the number of observations. 

For accuracy, look up the exact 
acceptable values in Durbin and 
Watson’s (1951) original paper. As 
a very conservative rule of thumb, 
values less than 1 or greater than 
3 are definitely cause for concern; 
however, values closer to 2 may still 
be problematic depending on the 
sample and model. 

Ecological validity: evidence that 
the results of a study, experiment 
or test can be applied, and allow 
inferences, to real-world conditions. 

Eel: long, snakelike, scaleless fish that 
lacks pelvic fins. From the order 
Anguilliformes or Apodes, they 
should probably not be inserted into 
your anus to cure constipation (or for 
any other reason). 

Editor window: The editor window in 
R is a basic text editor that enables 
you to collect together commands 
into a file rather than executing them 
individually through the command 
line in the console window. 

Effect size: an objective and (usually) 
standardized measure of the 
magnitude of an observed effect. 
Measures include Cohen’s d, 

Glass’s g and Pearson’s correlations 
coefficient, r. 

Error bar chart: a graphical 
representation of the mean of a 
set of observations that includes 
the 95% confidence interval of 
the mean. The mean is usually 


represented as a circle, square or 
rectangle at the value of the mean 
(or a bar extending to the value of 
the mean). The confidence interval is 
represented by a line protruding from 
the mean (upwards, downwards 
or both) to a short horizontal line 
representing the limits of the 
confidence interval. Error bars can 
be drawn using the standard error 
or standard deviation instead of the 
95% confidence interval. 

Error SSCP (E): the error sum of 
squares and cross-product matrix. 
This is a sum of squares and 
cross-product matrix for the error 
in a predictive linear model fitted to 
multivariate data. It represents the 
unsystematic variance and is the 
multivariate equivalent of the residual 
sum of squares. 

Eta squared (tf)\ an effect size 
measure that is the ratio of the 
model sum of squares to the total 
sum of squares. So, in essence, 
the coefficient of determination by 
another name. It doesn’t have an 
awful lot going for it: not only is it 
biased, but it typically measures 
the overall effect of an ANOVA 
and effect sizes are more easily 
interpreted when they reflect specific 
comparisons (e.g., the difference 
between two means). 

Experimental hypothesis: synonym 
for alternative hypothesis. 

Experimental research: a form 
of research in which one or 
more variables is systematically 
manipulated to see their effect 
(alone or in combination) on an 
outcome variable. This term implies 
that data will be able to be used 
to make statements about cause 
and effect. Compare with cross- 
sectional research and correlational 
research. 

Experimentwise error rate: the 

probability of making a Type I error 
in an experiment involving one or 
more statistical comparisons when 
the null hypothesis is true in each 
case. 

Extraction: a term used for the process 
of deciding whether a factor in factor 
analysis is statistically important 
enough to ‘extract’ from the data 
and interpret. The decision is based 
on the magnitude of the eigenvalue 
associated with the factor. See 
Kaiser’s criterion, scree plot. 

F max : see Hartley’s F max . 



DISCOVERING STATISTICS USING R 


918 


F-ratio: a test statistic with a known 
probability distribution (the 
F-distribution). It is the ratio of the 
average variability in the data that 
a given model can explain to the 
average variability unexplained 
by that same model. It is used to 
test the overall fit of the model in 
simple regression and multiple 
regression, and to test for overall 
differences between group means in 
experiments. 

Factor: another name for an 

independent variable or predictor 
that's typically used when describing 
experimental designs. However, to 
add to the confusion, it is also used 
synonymously with latent variable in 
factor analysis. 

Factor analysis: a multivariate 
technique for identifying whether 
the correlations between a set of 
observed variables stem from their 
relationship to one or more latent 
variables in the data, each of which 
takes the form of a linear model. 

Factor matrix: general term for the 
structure matrix in factor analysis. 

Factor loading: the regression 

coefficient of a variable for the linear 
model that describes a latent variable 
or factor in factor analysis. 

Factor scores: a single score from 
an individual entity representing 
their performance on some latent 
variable. The score can be crudely 
conceptualized as follows: take 
an entity’s score on each of the 
variables that make up the factor 
and multiply it by the corresponding 
factor loading for the variable, then 
add these values up (or average 
them). 

Factor transformation matrix. A: 

a matrix used in factor analysis. It 
can be thought of as containing the 
angles through which factors are 
rotated in factor rotation. 

Factorial ANOVA: an analysis of 
variance involving two or more 
independent variables or predictors. 

Falsification: the act of disproving a 
hypothesis or theory. 

Familywise error rate: the probability 
of making a Type I error in any family 
of tests when the null hypothesis 
is true in each case. The ‘family of 
tests’ can be loosely defined as a 
set of tests conducted on the same 
data set and addressing the same 
empirical question. 

Fisher’s exact test: Fisher’s exact 
test (Fisher, 1922) is not so much 


a test as a way of computing the 
exact probability of a statistic. It was 
designed originally to overcome the 
problem that with small samples 
the sampling distribution of the chi- 
square statistic deviates substantially 
from a chi-square distribution. It 
should be used with small samples. 

Fit: how sexually attractive you find a 
statistical test. Alternatively, it's the 
degree to which a statistical model is 
an accurate representation of some 
observed data. (Incidentally, it’s just 
plain wrong to find statistical tests 
sexually attractive.) 

Fixed coefficient: a coefficient or 
model parameter that is fixed; that 
is, it cannot vary over situations or 
contexts (cf. random coefficient). 

Fixed effect: An effect in an experiment 
is said to be a fixed effect if all 
possible treatment conditions that 
a researcher is interested in are 
present in the experiment. Fixed 
effects can be generalized only to 
the situations in the experiment. 

For example, the effect is fixed if 
we say that we are interested only 
in the conditions that we had in 
our experiment (e.g., placebo, low 
dose and high dose) and we can 
generalize our findings only to the 
situation of a placebo, low dose and 
high dose. 

Fixed intercept: A term used in 

multilevel modelling to denote when 
the intercept in the model is fixed. 
That is, it is not free to vary across 
different groups or contexts (cf. 
random intercept). 

Fixed slope: A term used in multilevel 
modelling to denote when the slope 
of the model is fixed. That is, it is not 
free to vary across different groups 
or contexts (cf. random slope). 

Fixed variable: A fixed variable is one 
that is not supposed to change over 
time (e.g., for most people their 
gender is a fixed variable - it never 
changes). 

Frequency distribution: a graph 
plotting values of observations 
on the horizontal axis, and the 
frequency with which each value 
occurs in the data set on the vertical 
axis (a.k.a. histogram). 

Friedman’s ANOVA: a non-parametric 
test of whether more than two related 
groups differ. It is the non-parametric 
version of one-way repeated- 
measures ANOVA. 

Generalization: the ability of a 
statistical model to say something 


beyond the set of observations that 
spawned it. If a model generalizes 
it is assumed that predictions from 
that model can be applied not just to 
the sample on which it is based, but 
to a wider population from which the 
sample came. 

Glossary: a collection of grossly 
inaccurate definitions (written late 
at night when you really ought to be 
asleep) of things that you thought 
you understood until some evil book 
publisher forced you to try to define 
them. 

Goodness of fit: an index of how well a 
model fits the data from which it was 
generated. It’s usually based on how 
well the data predicted by the model 
correspond to the data that were 
actually collected. 

Grand mean: the mean of an entire set 
of observations. 

Grand mean centring: grand mean 
centring means the transformation of 
a variable by taking each score and 
subtracting the mean of all scores 
(for that variable) from it (cf. group 
mean centring). 

Grand variance: the variance within an 
entire set of observations. 

Graphics window: the window in which 
graphics or graphs appear (this 
window is labelled Quartz in MacOS). 

Greenhouse-Geisser correction: 
an estimate of the departure from 
sphericity. The maximum value is 
1 (the data completely meet the 
assumption of sphericity) and 
the minimum is the lower bound. 
Values below 1 indicate departures 
from sphericity and are used to 
correct the degrees of freedom 
associated with the corresponding 
F-ratios by multiplying them by 
the value of the estimate. Some 
say the Greenhouse-Geisser 
correction is too conservative (strict) 
and recommend the Fluynh-Feld 
correction instead. 

Group mean centring: group mean 
centring is to transform a variable by 
taking each score and subtracting 
from it the mean of the scores (for 
that variable) for the group to which 
that score belongs (cf. grand mean 
centring). 

Growth curve: a curve that summarizes 
the change in some outcome over 
time. See polynomial. 

Harmonic mean: a weighted version 
of the mean that takes account of 
the relationship between variance 
and sample size. It is calculated 
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by summing the reciprocal of all 
observations, then dividing by 
the number of observations. The 
reciprocal of the end product is the 
harmonic mean: 


Hartley’s F max : also known as the 
variance ratio, is the ratio of the 
variances between the group with 
the biggest variance and the group 
with the smallest variance. This ratio 
is compared to critical values in a 
table published by Hartley as a test 
of homogeneity of variance. Some 
general rules are that with sample 
sizes (n) of 10 per group, an F max less 
than 10 is more or less always going 
to be non-significant, with 15-20 per 
group the ratio needs to be less than 
about 5, and with samples of 30-60 
the ratio should be below about 2 
or 3. 

Hat values: another name for leverage. 

HE~ U - this is a matrix that is functionally 
equivalent to the hypothesis 
SSCP divided by the error SSCP 
in MANOVA. Conceptually it 
represents the ratio of systematic 
to unsystematic variance, so is a 
multivariate analogue of the F-ratio. 

Helmert contrast: a non-orthogonal 
planned contrast that compares 
the mean of each condition (except 
the last) to the overall mean of all 
subsequent conditions combined. 

Heterogeneity of variance: the 

opposite of homogeneity of variance. 
This term means that the variance of 
one variable varies (i.e., is different) 
across levels of another variable. 

Heteroscedasticity: the opposite of 
homoscedasticity. This occurs when 
the residuals at each level of the 
predictor variables(s) have unequal 
variances. Put another way, at each 
point along any predictor variable, 
the spread of residuals is different. 

Hierarchical regression: a method of 
multiple regression in which the order 
in which predictors are entered into 
the regression model is determined 
by the researcher based on previous 
research: variables already known to 
be predictors are entered first, new 
variables are entered subsequently. 

Histogram: a frequency distribution. 

Homogeneity of covariance 
matrices: an assumption of 
some multivariate tests such as 


MANOVA. It is an extension of the 
homogeneity of variance assumption 
in univariate analyses. However, as 
well as assuming that variances for 
each dependent variable are the 
same across groups, it assumes 
that relationships ( covariances ) 
between these dependent variables 
are roughly equal. It is tested by 
comparing the population variance- 
covariance matrices of the different 
groups in the analysis. 

Homogeneity of regression slopes: 
an assumption of analysis of 
covariance. This is the assumption 
that the relationship between the 
covariate and outcome variable is 
constant across different treatment 
levels. So, if we had three treatment 
conditions, if there's a positive 
relationship between the covariate 
and the outcome in one group, we 
assume that there is a similar-sized 
positive relationship between the 
covariate and outcome in the other 
two groups too. 

Homogeneity of variance: the 

assumption that the variance of 
one variable is stable (i.e., relatively 
similar) at all levels of another 
variable. 

Homoscedasticity: an assumption in 
regression analysis that the residuals 
at each level of the predictor 
variables(s) have similar variances. 
Put another way, at each point along 
any predictor variable, the spread of 
residuals should be fairly constant. 

Hosmer and Lemeshow’s R[\ 
a version of the coefficient of 
determination for logistic regression. 

It is a fairly literal translation in that it 
is the -2 LL for the model divided by 
the original -2 LL - in other words, 
it’s the ratio of what the model can 
explain compared to what there was 
to explain in the first place! 

Hotelling-Lawley trace (P): a test 
statistic in MANOVA. It is the sum of 
the eigenvalues for each discriminant 
function variate of the data and so 
is conceptually the same as the 
F-ratio in ANOVA. it is the sum of the 
ratio of systematic and unsystematic 
variance (SS M /SS R ) for each of the 
variates. 

Huynh-Feldt correction: an estimate 
of the departure from sphericity. 

The maximum value is 1 (the data 
completely meet the assumption 
of sphericity). Values below this 
indicate departures from sphericity 
and are used to correct the degrees 


of freedom associated with the 
corresponding F-ratios by multiplying 
them by the value of the estimate. 

It is less conservative than the 
Greenhouse-Geisser estimate, but 
some say it is too liberal. 

Hypothesis: a prediction about the 
state of the world (see experimental 
hypothesis and null hypothesis). 

Hypothesis SSCP (H): the hypothesis 
sum of squares and cross-product 
matrix. This is a sum of squares and 
cross-product matrix for a predictive 
linear model fitted to multivariate 
data. It represents the systematic 
variance and is the multivariate 
equivalent of the model sum of 
squares. 

Identity matrix: a square matrix (i.e., 
with the same number of rows and 
columns) in which the diagonal 
elements are equal to 1, and the 
off-diagonal elements are equal to 0. 
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Independence: the assumption that 
one data point does not influence 
another. When data come from 
people, it basically means that the 
behaviour of one person does not 
influence the behaviour of another. 

Independent ANOVA: analysis of 
variance conducted on any design 
in which all independent variables or 
predictors have been manipulated 
using different participants (i.e., all 
data come from different entities). 

Independent design: an experimental 
design in which different treatment 
conditions utilize different organisms 
(e.g., in psychology, this would mean 
using different people in different 
treatment conditions) and so the 
resulting data are independent 
(a.k.a. between-group or between- 
subject designs). 

Independent errors: for any two 
observations in regression the 
residuals should be uncorrelated (or 
independent). 

Independent factorial design: an 

experimental design incorporating 
two or more predictors (or 
independent variables), all of which 
have been manipulated using 
different participants (or whatever 
entities are being tested). 
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Independent t-test: a test using 
the t-statistic that establishes 
whether two means collected 
from independent samples differ 
significantly. 

Independent variable: another name 
for a predictor variable. This name is 
usually associated with experimental 
methodology (which is the only time 
it makes sense) and is so called 
because it is the variable that is 
manipulated by the experimenter 
and so its value does not depend 
on any other variables (just on 
the experimenter). I just use the 
term predictor variable all the time 
because the meaning of the term 
is not constrained to a particular 
methodology. 

Interaction effect: the combined effect 
of two or more predictor variables on 
an outcome variable. 

Interaction graph: a graph showing the 
means of two or more independent 
variables in which means of one 
variable are shown at different levels 
of the other variable. Unusually the 
means are connected with lines, 
or are displayed as bars. These 
graphs are used to help understand 
interaction effects. 

Interquartile range: the limits within 
which the middle 50% of an ordered 
set of observations falls. It is the 
difference between the value of the 
upper quartile and lower quartile. 

Interval data: data measured on a 
scale along the whole of which 
intervals are equal. For example, 
people’s ratings of this book on 
Amazon.com can range from 1 to 
5; for these data to be interval it 
should be true that the increase 
in appreciation for this book 
represented by a change from 3 
to 4 along the scale should be the 
same as the change in appreciation 
represented by a change from 1 to 
2, or 4 to 5. 

Interval variable: a variable consisting 
of interval data. 

Intraclass correlation (ICC): a 

correlation coefficient that assesses 
the consistency between measures 
of the same class (i.e., measures 
of the same thing). (Cf. Pearson’s 
correlation coefficient, which 
measures the relationship between 
variables of a different class.) Two 
common uses are in comparing 
paired data (such as twins) on the 
same measure, and assessing the 
consistency between judges’ ratings 


of a set of objects. The calculation 
of these correlations depends on 
whether a measure of consistency 
(in which the order of scores from 
a source is considered, but not 
the actual value around which the 
scores are anchored) or absolute 
agreement (in which both the order 
of scores and the relative values are 
considered) is required, and whether 
the scores represent averages of 
many measures or just a single 
measure. This measure is also 
used in multilevel linear models to 
measure the dependency in data 
within the same context. 

Jonckheere-Terpstra test: this 

statistic tests for an ordered pattern 
of medians across independent 
groups. Essentially it does the same 
thing as the Kruskal-Wallis test 
(i.e., test for a difference between 
the medians of the groups), but 
it incorporates information about 
whether the order of the groups is 
meaningful. As such, you should 
use this test when you expect 
the groups you’re comparing to 
produce a meaningful order of 
medians. 

Kaiser-Meyer-Olkin (KMO) measure 
of sampling adequacy: the KMO 

can be calculated for individual and 
multiple variables and represents 
the ratio of the squared correlation 
between variables to the squared 
partial correlation between variables. 
It varies between 0 and 1: a value 
of 0 indicates that the sum of partial 
correlations is large relative to the 
sum of correlations, indicating 
diffusion in the pattern of correlations 
(hence, factor analysis is likely to be 
inappropriate); a value close to 1 
indicates that patterns of correlations 
are relatively compact and so factor 
analysis should yield distinct and 
reliable factors. Values between .5 
and .7 are mediocre, values between 
.7 and .8 are good, values between 
.8 and .9 are great and values above 
.9 are superb (see Flutcheson & 
Sofroniou, 1999). 

Kaiser’s criterion: a method of 
extraction in factor analysis based 
on the idea of retaining factors with 
associated eigenvalues greater 
than 1. This method appears to 
be accurate when the number of 
variables in the analysis is less than 
30 and the resulting communalities 
(after extraction) are all greater than 
.7, or when the sample size exceeds 


250 and the average communality is 
greater than or equal to .6. 

Kendall’s tau: a non-parametric 
correlation coefficient similar to 
Spearman's correlation coefficient, 
but which should be used in 
preference for a small data set with a 
large number of tied ranks. 

Kruskal-Wallis test: non-parametric 
test of whether more than two 
independent groups differ. It is the 
non-parametric version of one-way 
independent ANOVA. 

Kurtosis: this measures the degree to 
which scores cluster in the tails of a 
frequency distribution. A distribution 
with positive kurtosis ( leptokurtic, 
kurtosis > 0) has too many scores in 
the tails and is too peaked, whereas 
a distribution with negative kurtosis 
(platykurtic, kurtosis < 0) has too few 
scores in the tails and is quite flat. 

Latent variable: a variable that 
cannot be directly measured, but 
is assumed to be related to several 
variables that can be measured. 

Leptokurtic: see kurtosis. 

Levels of measurement: the 

relationship between what is being 
measured and the numbers obtained 
on a scale. 

Levene’s test: tests the hypothesis that 
the variances in different groups are 
equal (i.e., the difference between 
the variances is zero). It basically 
does a one-way ANOVA on the 
deviations (i.e., the absolute value 
of the difference between each 
score and the mean of its group). A 
significant result indicates that the 
variances are significantly different 
- therefore, the assumption of 
homogeneity of variances has been 
violated. When samples sizes are 
large, small differences in group 
variances can produce a significant 
Levene’s test and so the variance 
ratio is a useful double-check. 

Leverage: leverage statistics (or hat 
values) gauge the influence of the 
observed value of the outcome 
variable over the predicted values. 
The average leverage value is 
(k+1)/n in which k is the number of 
predictors in the model and n is the 
number of participants. Leverage 
values can lie between 0 (the case 
has no influence whatsoever) and 
1 (the case has complete influence 
over prediction). If no cases exert 
undue influence over the model then 
we would expect all of the leverage 
value to be close to the average 
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value. Hoaglin and Welsch (1978) 
recommend investigating cases 
with values greater than twice the 
average (2 (k + 1 )/n) and Stevens 
(2002) recommends using three 
times the average (3 (k + 1 )/n) as 
a cut-off point for identifying cases 
having undue influence. 

Likelihood: the probability of obtaining 
a set of observations given the 
parameters of a model fitted to those 
observations. 

Linear model: a model that is based 
upon a straight line. 

Line chart: a graph in which a summary 
statistic (usually the mean) is 
plotted on the y-axis against a 
categorical variable on the x-axis 
(this categorical variable could 
represent, for example, groups of 
people, different times or different 
experimental conditions). The value 
of the mean for each category is 
shown by a symbol, and means 
across categories are connected 
by a line. Different-coloured lines 
may be used to represent levels of a 
second categorical variable. 

Logistic regression: a version of 
multiple regression in which the 
outcome is a categorical variable. If 
the categorical variable has exactly 
two categories the analysis is called 
binary logistic regression, and when 
the outcome has more than two 
categories it is called multinomial 
logistic regression. 

Log-likelihood: a measure of error, or 
unexplained variation, in categorical 
models. It is based on summing 
the probabilities associated with the 
predicted and actual outcomes and 
is analogous to the residual sum 
of squares in multiple regression in 
that it is an indicator of how much 
unexplained information there is after 
the model has been fitted. Large 
values of the log-likelihood statistic 
indicate poorly fitting statistical 
models, because the larger the 
value of the log-likelihood, the more 
unexplained observations there are. 
The log-likelihood is the logarithm of 
the likelihood. 

Loglinear analysis: a procedure used 
as an extension of the chi-square 
test to analyse situations in which 
we have more than two categorical 
variables and we want to test 
for relationships between these 
variables. Essentially, a linear model 
is fitted to the data that predicts 
expected frequencies (i.e., the 


number of cases expected in a given 
category). In this respect it is much 
the same as analysis of variance but 
for entirely categorical data. 

Long format data (a.k.a. ‘molten’ 
data): data that are arranged 
such that levels of independent or 
predictor variables are differentiated 
by different rows in a dataframe. As 
such, outcome variable scores are 
contained in a single column of data 
with rows containing information 
about the attributes of those scores. 

Lower bound: the name given to 
the lowest possible value of the 
Greenhouse-Geisser estimate of 
sphericity. Its value is 1/(k-1), in 
which k is the number of treatment 
conditions. 

Lower quartile: the value that cuts off 
the lowest 25% of the data. If the 
data are ordered and then divided 
into two halves at the median, then 
the lower quartile is the median of 
the lower half of the scores. 

M-estimator: a robust measure of 
location. One example is the median. 
In some cases it is a measure of 
location computed after outliers have 
been removed: unlike the trimmed 
mean, the amount of trimming used 
to remove outliers is determined 
empirically. 

Main effect: the unique effect of a 
predictor variable (or independent 
variable) on an outcome variable. The 
term is usually used in the context of 
ANOVA. 

Mann-Whitney test: anon-parametric 
test that looks for differences 
between two independent samples. 
That is, it tests whether the 
populations from which two samples 
are drawn have the same location. 

It is functionally the same as 
Wilcoxon's rank-sum test, and both 
tests are non-parametric equivalents 
of the independent t-test. 

MANOVA: acronym for multivariate 
analysis of variance. 

Matrix: a collection of items (usually 
numbers) arranged in columns and 
rows. The values within a matrix are 
typically referred to as components 
or elements. 

Mauchly’s test: a test of the 

assumption of sphericity. If this test 
is significant then the assumption 
of sphericity has not been met 
and an appropriate correction 
must be applied to the degrees of 
freedom of the F-ratio in repeated- 
measures ANOVA. The test works by 


comparing the variance-covariance 
matrix of the data to an identity 
matrix-, if the variance-covariance 
matrix is a scalar multiple of an 
identity matrix then sphericity is met. 

Maximum-likelihood estimation: 
a way of estimating statistical 
parameters by choosing the 
parameters that make the data 
most likely to have happened. 
Imagine for a set of parameters that 
we calculated the probability (or 
likelihood) of getting the observed 
data; if this probability was high then 
these particular parameters yield a 
good fit of the data, but conversely 
if the probability was low, these 
parameters are a bad fit of our data. 
Maximum-likelihood estimation 
chooses the parameters that 
maximize the probability. 

McNemar’s test: This tests differences 
between two related groups (see 
Wilcoxon signed-rank test and 
sign test), when nominal data 
have been used. It’s typically used 
when we're looking for changes in 
people's scores and it compares the 
proportion of people who changed 
their response in one direction (i.e., 
scores increased) to those who 
changed in the opposite direction 
(scores decreased). So, this test 
needs to be used when we’ve got 
two related dichotomous variables. 

Mean: a simple statistical model of the 
centre of a distribution of scores. A 
hypothetical estimate of the ‘typical’ 
score. 

Mean squares: a measure of average 
variability. For every sum of squares 
(which measure the total variability) 
it is possible to create mean squares 
by dividing by the number of things 
used to calculate the sum of squares 
(or some function of it). 

Measurement error: the discrepancy 
between the numbers used to 
represent the thing that we’re 
measuring and the actual value 
of the thing we’re measuring (i.e., 
the value we would get if we could 
measure it directly). 

Median: the middle score of a set of 
ordered observations. When there 
is an even number of observations 
the median is the average of the two 
scores that fall either side of what 
would be the middle value. 

Meta-analysis: this is a statistical 
procedure for assimilating research 
findings. It is based on the simple 
idea that we can take effect sizes 
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from individual studies that research 
the same question, quantify the 
observed effect in a standard 
way (using effect sizes) and then 
combine these effects to get a more 
accurate idea of the true effect in the 
population. 

Mixed ANOVA: analysis of variance 
used for a mixed design. 

Mixed design: an experimental design 
incorporating two or more predictors 
(or independent variables) at least 
one of which has been manipulated 
using different participants (or 
whatever entities are being tested) 
and at least one of which has 
been manipulated using the same 
participants (or entities). Also known 
as a split-plot design because Fisher 
developed ANOVA for analysing 
agricultural data involving ‘plots’ of 
land containing crops. 

Mode: the most frequently occurring 
score in a set of data. 

Model sum of squares: a measure 
of the total amount of variability for 
which a model can account. It is the 
difference between the total sum 
of squares and the residual sum of 
squares. 

Molten data: see long format data. 

Monte Carlo method: a term applied 
to the process of using data 
simulations to solve statistical 
problems. Its name comes from the 
use of Monte Carlo roulette tables to 
generate ‘random’ numbers in the 
pre-computer age. Karl Pearson, for 
example, purchased copies of Le 
Monaco, a weekly Paris periodical 
that published data from the Monte 
Carlo casinos' roulette wheels. He 
used these data as pseudo-random 
numbers in his statistical research. 

Mosaic plot: A graphical display 
showing the relationship between 
two or more categorical variables. 

Multicollinearity: a situation in which 
two or more variables are very 
closely linearly related. 

Multilevel linear model: a linear 
model (just like regression, 

ANCOVA, ANOVA, etc.) in which the 
hierarchical structure of the data is 
explicitly considered. In this analysis 
regression parameters can be fixed 
(as in regression and ANOVA) but 
also random (i.e., free to vary across 
different contexts at a higher level 
of the hierarchy). This means that 
for each regression parameter there 
is a fixed component but also an 
estimate of how much the parameter 


varies across contexts (see fixed 
coefficient, random coefficient). 

Multimodal: description of a distribution 
of observations that has more than 
two modes. 

Multinomial logistic regression: 

logistic regression in which the 
outcome variable has more than two 
categories. 

Multiple R 2 : the multiple correlation 
coefficient squared. It is the 
proportion of variance shared by the 
observed values of an outcome and 
the values of the outcome predicted 
by a multiple regression model. 

Multiple regression: an extension 
of simple regression in which an 
outcome is predicted by a linear 
combination of two or more predictor 
variables. The form of the model is: 

Y =lb +bX +b X +... + b X )+e. 

10 11 / 2 2 / n ni i 

in which the outcome is denoted as 
Y, and each predictor is denoted as 
X. Each predictor has a regression 
coefficient b associated with it, and 
b 0 is the value of the outcome when 
all predictors are zero. 

Multivariate: means ‘many variables' 
and is usually used when referring 
to analyses in which there is more 
than one outcome variable (e.g., 

MAN OVA, principal components 
analysis, etc.). 

Multivariate analysis of variance: 

family of tests that extend the basic 
analysis of variance to situations 
in which more than one outcome 
variable has been measured. 

Multivariate normality: an extension 
of a normal distribution to multiple 
variables. It is a probability 
distribution of a set of variables 

i/=[v, V .,vJ Qivenby: 

f(v') = 2n 2 |I| 2 exp|-|(v -14 

in which p is the vector of means of 
the variables, and I is the variance- 
covariance matrix. If that made any 
sense to you then you’re cleverer 
than I am. 

Nagelkerke’s R 2 : a version of the 
coefficient of determination for 
logistic regression. It is a variation 
on Cox and Snell’s R^ s .which 
overcomes the problem that this 
statistic has of not being able to 
reach its maximum value. 

Negative skew: see skew. 

Nominal variable: where numbers 
merely represent names. For 
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example, the numbers on sports 
players shirts: a player with the 
number 1 on her back is not 
necessarily worse than a player with 
a 2 on her back. The numbers have 
no meaning other than denoting the 
type of player (i.e., full back, centre 
forward, etc.). 

Non-parametric tests: a family of 
statistical procedures that do not 
rely on the restrictive assumptions of 
parametric tests. In particular, they 
do not assume that the sampling 
distribution is normally distributed. 

Normal distribution: a probability 
distribution of a random variable that 
is known to have certain properties. 

It is perfectly symmetrical (has a 
skew of 0), and has a kurtosis of 0. 

Normally distributed data (as an 
assumption): when generalizing 
the findings of parametric tests there 
is typically an assumption made that 
something is normally distributed; 
in some cases it is the sampling 
distribution, in others the errors in 
the model, if this assumption is not 
true then robust tests should be 
applied. 

Null hypothesis: the reverse of 
the experimental hypothesis that 
your prediction is wrong and the 
predicted effect doesn’t exist. 

Numeric variables: variables involving 
numbers. 

Object: anything created in R. It 
could be a variable, a collection of 
variables, a statistical model, etc. 
Objects can be single values (such 
as the mean of a set of scores) 
or collections of information; for 
example, when you run an analysis, 
you create an object that contains 
the output of that analysis, which 
means that this object contains 
many different values and variables. 

Oblique rotation: a method of rotation 
in factor analysis that allows the 
underlying factors to be correlated. 

Odds: the probability of an event 
occurring divided by the probability 
of that event not occurring. 

Odds ratio: the ratio of the odds of 
an event occurring in one group 
compared to another. So, for 
example, if the odds of dying after 
writing a glossary are 4, and the 
odds of dying after not writing a 
glossary are 0.25, then the odds 
ratio is 4/0.25 = 16. This means 
that the odds of dying if you write 
a glossary are 16 times higher 
than if you don’t. An odds ratio of 
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1 would indicate that the odds of 
a particular outcome are equal in 
both groups. 

Omega squared: an effect size 
measure associated with ANOVA 
that is less bias than eta squared. It 
is a (sometimes hideous) function 
of the model sum of squares and 
the residual sum of squares and 
isn't actually much use because it 
measures the overall effect of the 
ANOVA and so can’t be interpreted 
in a meaningful way. In all other 
respects it’s great though. 

One-tailed test: a test of a directional 
hypothesis. For example, the 
hypothesis ‘the longer I write this 
glossary, the more I want to place 
my editor’s genitals in a starved 
crocodile’s mouth’ requires a one- 
tailed test because I’ve stated the 
direction of the relationship (see also 
two-tailed test). 

Ordinal variable: data that tell us not 
only that things have occurred, 
but also the order in which they 
occurred. These data tell us nothing 
about the differences between 
values. For example, gold, silver 
and bronze medals are ordinal: they 
tell us that the gold medallist was 
better than the silver medallist, but 
they don’t tell us how much better 
(was gold a lot better than silver, or 
were gold and silver very closely 
competed?). 

Orthogonal: means perpendicular (at 
right angles) to something. It tends 
to be equated to independence in 
statistics because of the connotation 
that perpendicular linear models in 
geometric space are completely 
independent (one is not influenced 
by the other). 

Orthogonal rotation: a method of 
rotation in factor analysis that keeps 
the underlying factors independent 
(i.e., not correlated). 

Outcome variable: a variable whose 
values we are trying to predict from 
one or more predictor variables. 

Outlier: an observation very different 
from most others. Outliers can bias 
statistics such as the mean. 

Package: a collection of functions that, 
once the package is installed and 
loaded, can be used in R. 

Pairwise comparisons: comparisons 
of pairs of means. 

Parametric test: a test that requires 
data from one of the large catalogue 
of distributions that statisticians have 
described. Normally this term is 


used for parametric tests based on 
the normal distribution, which require 
four basic assumptions that must 
be met for the test to be accurate: 
a normally distributed sampling 
distribution (see normal distribution), 
homogeneity of variance, interval or 
ratio data, and independence. 

Part correlation: another name for a 
semi-partial correlation. 

Partial correlation: a measure of the 
relationship between two variables 
while ‘controlling’ the effect on both 
of one or more additional variables. 

Partial eta squared (partial rf): a 
version of eta squared that is the 
proportion of variance that a variable 
explains when excluding other 
variables in the analysis. Eta squared 
is the proportion of total variance 
explained by a variable, whereas 
partial eta squared is the proportion 
of variance that a variable explains 
that is not explained by other 
variables. 

Partial out: to partial out the effect of 
a variable is to remove the variance 
that the variable shares with other 
variables in the analysis before 
looking at their relationships (see 
partial correlation). 

Pattern matrix: a matrix in factor 
analysis containing the regression 
coefficients for each variable on 
each factor in the data. See also 
structure matrix. 

Pearson’s correlation coefficient: 

or Pearson’s product-moment 
correlation coefficient to give it its full 
name, is a standardized measure of 
the strength of relationship between 
two variables. It can take any value 
from -1 (as one variable changes, 
the other changes in the opposite 
direction by the same amount), 
through 0 (as one variable changes 
the other doesn't change at all), to 
+ 1 (as one variable changes, the 
other changes in the same direction 
by the same amount). 

Perfect collinearity: exists when at 
least one predictor in a regression 
model is a perfect linear combination 
of the others (the simplest example 
being two predictors that are 
perfectly correlated - they have a 
correlation coefficient of 1). 

Phi: a measure of the strength of 

association between two categorical 
variables. Phi is used with 2x2 
contingency tables (tables which 
have two categorical variables 
and each variable has only two 


categories). Phi is a variant of the 
chi-square test, x 2 '- it is given by 
( /> = ^ x 2 /n, in which n is the total 
number of observations. 

Pillai—Bartlett trace (V): a test statistic 
in MANOVA. It is the sum of the 
proportion of explained variance on 
the discriminant function variates of 
the data. As such, it is similar to the 
ratio of SS M /SS T 

Planned comparisons: another name 
for planned contrasts. 

Planned contrasts: a set of 

comparisons between group means 
that are constructed before any data 
are collected. These are theory-led 
comparisons and are based on 
the idea of partitioning the variance 
created by the overall effect of group 
differences into gradually smaller 
portions of variance. These tests 
have more power than post hoc 
tests. 

Platykurtic: see kurtosis. 

Point-biserial correlation: a 

standardized measure of the 
strength of relationship between 
two variables when one of the two 
variables is dichotomous. The point- 
biserial correlation coefficient is used 
when the dichotomy is discrete, or 
true, dichotomy (i.e., one for which 
there is no underlying continuum 
between the categories). An example 
of this is pregnancy: you can be 
either pregnant or not, there is no 
in-between state. 

Polychotomous logistic regression: 

another name for multinomial logistic 
regression. 

Polynomial: a posh name for a growth 
curve or trend over time. If time is 
our predictor variable, then any 
polynomial is tested by including a 
variable that is the predictor to the 
power of the order of polynomial 
that we want to test: a linear trend is 
tested by time alone, a quadratic or 
second-order polynomial is tested 
by including a predictor that is time 2 , 
for a fifth-order polynomial we need 
a predictor of time 5 and for an nth- 
order polynomial we would have to 
include time" as a predictor. 

Polynomial contrast: a contrast that 
tests for trends in the data. In its 
most basic form it looks for a linear 
trend (i.e., that the group means 
increase proportionately). 

Population: in statistical terms this 
usually refers to the collection of 
units (be they people, plankton, 
plants, cities, suicidal authors, etc.) 
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to which we want to generalize a set 
of findings or a statistical model. 

Positive skew: see skew. 

Post hoc tests: a set of comparisons 
between group means that were 
not thought of before data were 
collected. Typically these tests 
involve comparing the means of all 
combinations of pairs of groups. To 
compensate for the number of tests 
conducted, each test uses a strict 
criterion for significance. As such, 
they tend to have less power than 
planned contrasts. They are usually 
used for exploratory work for which 
no firm hypotheses were available on 
which to base planned contrasts. 

Power: the ability of a test to detect an 
effect of a particular size (a value of 
.8 is a good level to aim for). 

Practice effect: refers to the possibility 
that participants’ performance in a 
task may be influenced (positively 
or negatively) if they repeat the 
task because of familiarity with the 
experimental situation and/or the 
measures being used. 

Predictor variable: a variable that 
is used to try to predict values 
of another variable known as an 
outcome variable. 

Principal components analysis 
(PCA): a multivariate technique for 
identifying the linear components of 
a set of variables. 

Probability distribution: a curve 
describing an idealized frequency 
distribution of a particular variable 
from which it is possible to ascertain 
the probability with which specific 
values of that variable will occur. For 
categorical variables it is simply a 
formula yielding the probability with 
which each category occurs. 

Promax: a method of oblique rotation 
that is computationally faster than 
direct oblimin and so useful for large 
data sets. 

Q-Q plot: Short for quantile-quantile 
plot. A graph plotting the quantiles of 
a variable against the quantiles of a 
particular distribution (often a normal 
distribution). If values fall on the 
diagonal of the plot then the variable 
shares the same distribution as the 
one specified. Deviations from the 
diagonal show deviations from the 
distribution of interest. 

Quadratic trend: if the means in 
ordered conditions are connected 
with a line then a quadratic trend 
is shown by one change in the 
direction of this line (e.g., the line 


is curved in one place); the line is, 
therefore, U-shaped. There must be 
at least three ordered conditions. 

Qualitative methods: extrapolating 
evidence for a theory from what 
people say or write (contrast with 
quantitative methods). 

Quantiles: values that split a data 
set into equal portions. Quartiles, 
for example, are a special case of 
quantiles that split the data into four 
equal parts. Similarly, percentiles 
are points that split the data into 100 
equal parts and noniles are points 
that split the data into nine equal 
parts (you get the general idea). 

Quantitative methods: inferring 
evidence for a theory through 
measurement of variables that 
produce numeric outcomes (contrast 
with qualitative methods). 

Quartic trend: if the means in ordered 
conditions are connected with a 
line then a quartic trend is shown 
by three changes in the direction of 
this line. There must be at least five 
ordered conditions. 

Quartiles: a generic term for the three 
values that cut an ordered data 
set into four equal parts. The three 
quartiles are known as the lower 
quartile, the second quartile (or 
median) and the upper quartile. 

Quartimax: a method of orthogonal 
rotation. It attempts to maximize 
the spread of factor loadings for a 
variable across all factors. This often 
results in lots of variables loading 
highly on a single factor. 

Quartz window: the name of the 
window in which graphics and 
graphs appear if you use MacOS. 

Random coefficient: a coefficient or 
model parameter that is free to vary 
over situations or contexts (cf. fixed 
coefficient). 

Random effect: an effect is said to be 
random if the experiment contains 
only a sample of possible treatment 
conditions. Random effects can be 
generalized beyond the treatment 
conditions in the experiment. For 
example, the effect is random if 
we say that the conditions in our 
experiment (e.g., placebo, low dose 
and high dose) are only a sample 
of possible conditions (maybe we 
could have tried a very high dose). 
We can generalize this random effect 
beyond just placebos, low doses 
and high doses. 

Random intercept: A term used in 
multilevel modelling to denote when 


the intercept in the model is free 
to vary across different groups or 
contexts (cf. fixed intercept). 

Random slope: A term used in 

multilevel modelling to denote when 
the slope of the model is free to vary 
across different groups or contexts 
(cf. fixed slope). 

Random variable: a random variable is 
one that varies over time (e.g., your 
weight is likely to fluctuate over time). 

Random variance: variance that is 
unique to a particular variable but 
not reliably so. 

Randomization: the process of doing 
things in an unsystematic or random 
way. In the context of experimental 
research the word usually applies 
to the random assignment of 
participants to different treatment 
conditions. 

Range: the range of scores is value of 
the smallest score subtracted from 
the highest score. It is a measure 
of the dispersion of a set of scores. 
See also variance, standard 
deviation, and interquartile range. 

Ranking: the process of transforming 
raw scores into numbers that 
represent their position in an ordered 
list of those scores, i.e., the raw 
scores are ordered from lowest 
to highest and the lowest score 
is assigned a rank of 1, the next 
highest score is assigned a rank of 
2, and so on. 

Ratio variable: an interval variable but 
with the additional property that 
ratios are meaningful. For example, 
people’s ratings of this book on 
Amazon.com can range from 1 
to 5; for these data to be ratio not 
only must they have the properties 
of interval variables, but in addition 
a rating of 4 should genuinely 
represent someone who enjoyed this 
book twice as much as someone 
who rated it as 2. Likewise, someone 
who rated it as 1 should be half as 
impressed as someone who rated 
it as 2. 

Regression coefficient: see and p r 

Regression model: see multiple 
regression and simple regression. 

Regression line: a line on a scatterplot 
representing the regression model 
of the relationship between the two 
variables plotted. 

Related design: another name for a 
repeated-measures design. 

Related factorial design: an 

experimental design incorporating 
two or more predictors (or 
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independent variables), all of which 
have been manipulated using the 
same participants (or whatever 
entities are being tested). 

Reliability: the ability of a measure to 
produce consistent results when the 
same entities are measured under 
different conditions. 

Repeated-measures ANOVA: an 
analysis of variance conducted on 
any design in which the independent 
variable (predictor) or variables 
(predictors ) have all been measured 
using the same participants in all 
conditions. 

Repeated-measures design: an 

experimental design in which 
different treatment conditions 
utilize the same organisms (i.e., 
in psychology, this would mean 
the same people take part in all 
experimental conditions) and so 
the resulting data are related (a.k.a. 
related design or within-subject 
designs). 

Residual: The difference between the 
value a model predicts and the value 
observed in the data on which the 
model is based. When the residual 
is calculated for each observation in 
a data set the resulting collection is 
referred to as the residuals. 

Residual sum of squares: a measure 
of the variability that cannot be 
explained by the model fitted to the 
data. It is the total squared deviance 
between the observations, and 
the value of those observations 
predicted by whatever model is fitted 
to the data. 

Residuals: see residual. 

Robust test: A term applied to a family 
of procedures to estimate statistics 
that are reliable even when the 
normal assumptions of the statistic 
are not met. 

Rotation: a process in factor analysis 
for improving the interpretability 
of factors. In essence, an attempt 
is made to transform the factors 
that emerge from the analysis in 
such a way as to maximize factor 
loadings that are already large, and 
minimize factor loadings that are 
already small. There are two general 
approaches: orthogonal rotation and 
oblique rotation. 

Roy’s largest root: a test statistic in 
MANOVA. It is the eigenvalue for 
the first discriminant function variate 
of a set of observations. So, it is 
the same as the Hotelling-Lawley 
trace, but for the first variate only. 


It represents the proportion of 
explained variance to unexplained 
variance (SS U /SS R ) for the first 
discriminant function. 

Sample: a smaller (but hopefully 
representative) collection of units 
from a population used to determine 
truths about that population (e.g., 
how a given population behaves in 
certain conditions). 

Sampling distribution: the probability 
distribution of a statistic. We can 
think of this as follows: if we take 
a sample from a population and 
calculate some statistic (e.g., the 
mean), the value of this statistic will 
depend somewhat on the sample 
we took. As such, the statistic will 
vary slightly from sample to sample, 
if, hypothetically, we took lots and 
lots of samples from the population 
and calculated the statistic of 
interest we could create a frequency 
distribution of the values we get. 

The resulting distribution is what the 
sampling distribution represents: the 
distribution of possible values of a 
given statistic that we could expect 
to get from a given population. 

Sampling variation: the extent to which 
a statistic (the mean, median, t, F, 
etc.) varies in samples taken from 
the same population. 

Saturated model: a model that 

perfectly fits the data and, therefore, 
has no error. It contains all possible 
main effects and interactions 
between variables. 

Scatterplot: a graph that plots 
values of one variable against the 
corresponding value of another 
variable (and the corresponding 
value of a third variable can also be 
included on a 3-D scatterplot). 

Scree plot: a graph plotting each factor 
in a factor analysis (X-axis) against 
its associated eigenvalue (Y-axis). 

It shows the relative importance of 
each factor. This graph has a very 
characteristic shape (there is a sharp 
descent in the curve followed by a 
tailing off) and the point of inflexion 
of this curve is often used as a 
means of extraction. With a sample 
of more than 200 participants, this 
provides a fairly reliable criterion for 
extraction (Stevens, 2002) 

Second quartile: another name for the 
median. 

Semi-partlal correlation: a measure 
of the relationship between two 
variables while ‘controlling’ the effect 
that one or more additional variables 


has on one of those variables, if we 
call our variables x and y, it gives us 
a measure of the variance in y thatx 
alone shares. 

Shapiro-Wilk test: a test of whether a 
distribution of scores is significantly 
different from a normal distribution. 

A significant value indicates a 
deviation from normality, but this 
test is notoriously affected by large 
samples in which small deviations 
from normality yield significant 
results. 

Shrinkage: the loss of predictive 
power of a regression model if the 
model had been derived from the 
population from which the sample 
was taken, rather than the sample 
itself. 

Simple effects analysis: this 
analysis looks at the effect of one 
independent variable (categorical 
predictor variable) at individual levels 
of another independent variable. 

Simple regression: a linear model in 
which one variable or outcome is 
predicted from a single predictor 
variable. The model takes the form: 

Y.=(b +bX)+s. 

in which Y is the outcome variable, 

X is the predictor, to, is the regression 
coefficient associated with the 
predictor and to 0 is the value of the 
outcome when the predictor is zero. 

Singularity: a term used to describe 
variables that are perfectly correlated 
(i.e., the correlation coefficient is 1 
or -1). 

Skew: a measure of the symmetry of a 
frequency distribution. Symmetrical 
distributions have a skew of 0. When 
the frequent scores are clustered at 
the lower end of the distribution and 
the tail points towards the higher or 
more positive scores, the value of 
skew is positive. Conversely, when 
the frequent scores are clustered 
at the higher end of the distribution 
and the tail points towards the lower 
more negative scores, the value of 
skew is negative. 

Spearman’s correlation coefficient: 

a standardized measure of the 
strength of relationship between two 
variables that does not rely on the 
assumptions of a parametric test. It 
is Pearson’s correlation coefficient 
performed on data that have been 
converted into ranked scores. 

Sphericity: a less restrictive form 
of compound symmetry, which 
assumes that the variances of the 
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differences between data taken 
from the same participant (or other 
entity being tested) are equal. This 
assumption is most commonly 
found in repeated-measures ANOVA 
but applies only where there are 
more than two points of data from 
the same participant (see also 
Greenhouse-Geisser correction, 
Huynh-Feldt correction). 

Split-half reliability: a measure of 
reliability obtained by splitting items 
on a measure into two halves (in 
some random fashion) and obtaining 
a score from each half of the scale. 
The correlation between the two 
scores, corrected to take account of 
the fact the correlations are based 
on only half of the items, is used as 
a measure of reliability. There are two 
popular ways to do this. Spearman 
(1910) and Brown (1910) developed 
a formula that takes no account of 
the standard deviation of items: 

K 

^ 1 + r t2 

in which r K is the correlation between 
the two halves of the scale. Flanagan 
(1937) and Rulon (1939), however, 
proposed a measure that does 
account for item variance: 

4 r xs xs 

_ 12 1 2 


in which s, and s 2 are the standard 
deviations of each half of the scale, 
and is the variance of the whole 
test. See Cortina (1993) for more 
detail 

Square matrix: a matrix that has an 
equal number of columns and rows. 

Standard deviation: an estimate of the 
average variability (spread) of a set 
of data measured in the same units 
of measurement as the original data. 
It is the square root of the variance. 

Standard error: the standard deviation 
of the sampling distribution of a 
statistic. For a given statistic (e.g., 
the mean) it tells us how much 
variability there is in this statistic 
across samples from the same 
population. Large values, therefore, 
indicate that a statistic from a given 
sample may not be an accurate 
reflection of the population from 
which the sample came. 

Standard error of differences: if we 
were to take several pairs of samples 
from a population and calculate 
their means, then we could also 


calculate the difference between 
their means. If we plotted these 
differences between sample means 
as a frequency distribution, we would 
have the sampling distribution of 
differences. The standard deviation 
of this sampling distribution is the 
standard error of differences. As such 
it is a measure of the variability of 
differences between sample means. 

Standard error of the mean (SE): the 
full name of the standard error. 

Standardization: the process of 

converting a variable into a standard 
unit of measurement. The unit of 
measurement typically used is 
standard deviation units (see also 
z-scores). Standardization allows us 
to compare data when different units 
of measurement have been used (we 
could compare weight measured 
in kilograms to height measured in 
inches). 

Standardized: see standardization. 

Standardized DFBeta: a standardized 
version of DFBeta. These 
standardized values are easier to 
use than DFBeta because universal 
cut-off points can be applied. 

Stevens (2002) suggests looking at 
cases with absolute values greater 
than 2. 

Standardized DFFit: a standardized 
version of DFFit. 

Standardized residuals: the residuals 
of a model expressed in standard 
deviation units. Standardized 
residuals with an absolute value 
greater than 3.29 (actually, we 
usually just use 3) are cause for 
concern because in an average 
sample a value this high is unlikely 
to happen by chance; if more 
than 1 % of our observations have 
standardized residuals with an 
absolute value greater than 2.58 
(we usually just say 2.5), there 
is evidence that the level of error 
within our model is unacceptable 
(the model is a fairly poor fit of the 
sample data); and if more than 5% 
of observations have standardized 
residuals with an absolute 
value greater than 1.96 (or 2 for 
convenience), then there is also 
evidence that the model is a poor 
representation of the actual data. 

Stepwise regression: a method 
of multiple regression in which 
variables are entered into the model 
based on a statistical criterion (the 
semi-partial correlation with the 
outcome variable). Once a new 


variable is entered into the model, all 
variables in the model are assessed 
to see whether they should be 
removed. 

String variables: variables involving 
words (i.e., letter strings). Such 
variables could include responses to 
open-ended questions such as ‘how 
much do you like writing glossary 
entries?'; the response might be 
‘about as much as I like placing my 
gonads on hot coals'. 

Structure matrix: a matrix in factor 
analysis containing the correlation 
coefficients for each variable on 
each factor in the data. When 
orthogonal rotation is used this is 
the same as the pattern matrix, but 
when oblique rotation is used these 
matrices are different. 

Studentized deleted residual: a 
measure of the influence of a 
particular case of data. This is a 
standardized version of the deleted 
residual. 

Studentized residuals: a variation on 
standardized residuals. Studentized 
residuals are the unstandardized 
residual divided by an estimate of its 
standard deviation that varies point 
by point. These residuals have the 
same properties as the standardized 
residuals but usually provide a more 
precise estimate of the error variance 
of a specific case. 

Sum of squared errors: another name 
for the sum of squares. 

Sum of squares (SS): an estimate 
of total variability (spread) of a set 
of data. First the deviance for each 
score is calculated, and then this 
value is squared. The SS is the sum 
of these squared deviances. 

Sum of squares and cross-products 
matrix (SSCP matrix): a square 
matrix in which the diagonal 
elements represent the sum of 
squares for a particular variable, and 
the off-diagonal elements represent 
the cross-products between pairs 
of variables. The SSCP matrix is 
basically the same as the variance- 
covariance matrix, except the SSCP 
matrix expresses variability and 
between-variable relationships as 
total values, whereas the variance- 
covariance matrix expresses them as 
average values. 

Suppressor effects: when a predictor 
has a significant effect but only when 
another variable is held constant. 

Systematic variation: variation due 
to some genuine effect (be that 
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the effect of an experimenter doing 
something to all of the participants 
in one sample but not in other 
samples, or natural variation 
between sets of variables). We can 
think of this as variation that can be 
explained by the model that we’ve 
fitted to the data. 

t-statistic: Student’s t is a test statistic 
with a known probability distribution 
(the f-distribution). In the context 
of regression it is used to test 
whether a regression coefficient b 
is significantly different from zero; in 
the context of experimental work it is 
used to test whether the differences 
between two means are significantly 
different from zero. See also 
dependent t-test and independent 
t-test. 

Tertium quid : the possibility that an 
apparent relationship between two 
variables is actually caused by the 
effect of a third variable on them 
both (often called the third-variable 
problem). 

Test-retest reliability: the ability of 

a measure to produce consistent 
results when the same entities are 
tested at two different points in time. 

Test statistic: a statistic for which we 
know how frequently different values 
occur. The observed value of such 
a statistic is typically used to test 
hypotheses. 

Theory: although it can be defined 
more formally, a theory is a 
hypothesized general principle or 
set of principles that explain known 
findings about a topic and from 
which new hypotheses can be 
generated. 

Tolerance: tolerance statistics 
measure multicollinearity and are 
simply the reciprocal of the variance 
inflation factor (1 A/IF). Values below 
0.1 indicate serious problems, 
although Menard (1995) suggests 
that values below 0.2 are worthy of 
concern. 

Total SSCP (T): the total sum of 
squares and cross-product matrix. 
This is a sum of squares and cross- 
product matrix for an entire set of 
observations. It is the multivariate 
equivalent of the total sum of 
squares. 

Total sum of squares: a measure 
of the total variability within a set 
of observations. It is the total 
squared deviance between each 
observation and the overall mean of 
all observations. 


Transformation: the process of 
applying a mathematical function 
to all observations in a data set, 
usually to correct some distributional 
abnormality such as skew or 
kurtosis. 

Treatment contrast: a contrast in which 
each category is compared to a 
user-defined baseline category. 

Trimmed mean: a statistic used in 
many robust tests. Imagine we had 
20 scores representing the annual 
income of students (in thousands, 
rounded to the nearest thousand: 

2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 

4, 4, 4, 4, 6, 35. The mean income 
is 5 (£5000). This value is biased by 
an outlier. A trimmed mean is simply 
a mean based on the distribution 
of scores after some percentage of 
scores has been removed from each 
extreme of the distribution. So, a 
10% trimmed mean will remove 10% 
of scores from the top and bottom of 
ordered scores before the mean is 
calculated. With 20 scores, removing 
10% of scores involves removing the 
top and bottom 2 scores. This gives 
us: 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 

4, 4, 4, the mean of which is 3.44. 

The mean depends on a symmetrical 
distribution to be accurate, but a 
trimmed mean produces accurate 
results even when the distribution 
is not symmetrical. There are 
more complex examples of robust 
methods such as the bootstrap. 

Two-tailed test: a test of a non- 
directional hypothesis. For 
example, the hypothesis ‘writing 
this glossary has some effect on 
what I want to do with my editor’s 
genitals’ requires a two-tailed test 
because it doesn’t suggest the 
direction of the relationship. See 
also one-tailed test. 

Type I error: occurs when we believe 
that there is a genuine effect in our 
population, when in fact there isn’t. 

Type II error: occurs when we 

believe that there is no effect in the 
population when, in reality, there is. 

Unique variance: variance that is 
specific to a particular variable (i.e., 
is not shared with other variables). 
We tend to use the term ‘unique 
variance’ to refer to variance that 
can be reliably attributed to only 
one measure, otherwise it is called 
random variance. 

Univariate: means ‘one variable’ and is 
usually used to refer to situations in 
which only one outcome variable has 


been measured ( \.e.,ANOVA, t-tests, 
Mann-Whitney tests, etc.). 

Unstructured: a covariance structure 
used in multilevel models. This 
covariance structure is completely 
general. Covariances are assumed 
to be completely unpredictable: 
they do not conform to a systematic 
pattern. 

Unstandardized residuals: the 

residuals of a model expressed 
in the units in which the original 
outcome variable was measured. 

Unsystematic variation: this is 

variation that isn't due to the effect in 
which we’re interested (so could be 
due to natural differences between 
people in different samples such 
as differences in intelligence or 
motivation). We can think of this as 
variation that can’t be explained by 
whatever model we’ve fitted to the 
data. 

Upper quartile: the value that cuts 
off the highest 25% of ordered 
scores. If the scores are ordered 
and then divided into two halves at 
the median, then the upper quartile 
is the median of the top half of the 
scores. 

Validity: evidence that a study allows 
correct inferences about the 
question it was aimed to answer or 
that a test measures what it set out 
to measure conceptually (see also 
content validity, criterion validity). 

Variable view: there are two ways to 
view the contents of the data editor 
window. The variable view allows you 
to define properties of the variables 
for which you wish to enter data. See 
also data view. 

Variables: anything that can be 
measured and can differ across 
entities or across time. 

Variance: an estimate of average 
variability (spread) of a set of data. It 
is the sum of squares divided by the 
number of values on which the sum 
of squares is based minus 1. 

Variance components: a covariance 
structure used in multilevel models. 
This covariance structure is very 
simple and assumes that all 
random effects are independent 
and variances of random effects are 
assumed to be the same and sum to 
the variance of the outcome variable. 

Variance-covariance matrix: a square 
matrix (i.e., same number of columns 
and rows) representing the variables 
measured. The diagonals represent 
the variances within each variable, 
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whereas the off-diagonals represent 
the covariances between pairs of 
variables. 

Variance inflation factor (VIF): a 

measure of multicollinearity. The 
VIF indicates whether a predictor 
has a strong linear relationship with 
the other predictor(s). Myers (1990) 
suggests that a value of 10 is a good 
value at which to worry. Bowerman 
and O’Connell (1990) suggest that 
if the average VIF is greater than 1, 
then multicollinearity may be biasing 
the regression model. 

Variance ratio: see Hartley’s F max . 

Variance sum law: states that the 
variance of a difference between two 
independent variables is equal to the 
sum of their variances. 

Varimax: a method of orthogonal 
rotation. It attempts to maximize the 
dispersion of factor loadings within 
factors. Therefore, it tries to load a 
smaller number of variables highly 
on each factor, resulting in more 
interpretable clusters of factors. 

VIF: see variance inflation factor. 

Wald statistic: a test statistic with a 
known probability distribution (a 
chi-square distribution) that is used 
to test whether the b coefficient for 
a predictor in a logistic regression 
model is significantly different from 
zero. It is analogous to the t-statistic 
in a regression model in that it is 
simply the b coefficient divided by 
its standard error. The Wald statistic 
is inaccurate when the regression 
coefficient (b) is large, because the 
standard error tends to become 
inflated, resulting in the Wald statistic 
being underestimated. 

Weights: a number by which something 
(usually a variable in statistics) is 
multiplied. The weight assigned 
to a variable determines the 
influence that variable has within 
a mathematical equation: large 


weights give the variable a lot of 
influence. 

Welch’s F: a version of the F-ratio 
designed to be accurate when 
the assumption of homogeneity of 
variance has been violated. Not to 
be confused with the squelch test 
which is where you shake your head 
around after writing statistics books 
to see if you still have a brain. 

Welch’s (-test: a modification of the 
independent t-test that does not 
assume equal population variances. 
Therefore, it can be used as an 
adjustment to correct for violation of 
the assumption of homogeneity of 
variance. 

Wide format data: data that are 
arranged such that levels of 
independent or predictor variables 
are differentiated by different 
columns in a dataframe. As such, 
outcome variable scores are 
contained in multiple columns of 
data each column representing a 
level of an independent variable. 

Wilcoxon’s rank-sum test: a 
non-parametric test that looks 
for differences between two 
independent samples. That is, it 
tests whether the populations from 
which two samples are drawn have 
the same location. It is functionally 
the same as the Mann-Whitney test, 
and both tests are non-parametric 
equivalents of the independent t-test. 

Wilcoxon signed-rank test: a 
non-parametric test that looks for 
differences between two related 
samples. It is the non-parametric 
equivalent of the related t-test. 

Wilks’s lambda (A.): a test statistic in 
MANOVA. It is the product of the 
unexplained variance on each of the 
discriminant function variates, so it 
represents the ratio of error variance 
to total variance (SSp/SS^ for each 
variate. 


Within-subject design: another name 
for a repeated-measures design. 

Workspace: the collection of objects, 
models, dataframes and other things 
that you have created during an R 
session. 

Working directory: a directory that R 
uses as the default location to open, 
save and ‘look for’ files. You should 
set the working directory to be the 
folder in which you have stored your 
data files, any scripts associated 
with the analysis or your workspace. 
Basically, anything to do with a 
session. 

Writer’s block: something I suffered 
from a lot while writing this 
edition. It’s when you can’t think 
of any decent examples and so 
end up talking about sperm the 
whole time. Seriously, look at this 
book, it’s all sperm this, sperm 
that, quail sperm, human sperm. 
Frankly, I’m amazed donkey sperm 
didn’t get in there somewhere. Oh, 
it just did. 

Yates’s continuity correction: an 

adjustment made to the chi-square 
test when the contingency table 
is 2 rows by 2 columns (i.e., there 
are two categorical variables, 
both of which consist of only two 
categories). In large samples the 
adjustment makes little difference 
and is slightly dubious anyway (see 
Howell, 2006). 

z-score: the value of an observation 
expressed in standard deviation 
units. It is calculated by taking the 
observation, subtracting from it 
the mean of all observations, and 
dividing the result by the standard 
deviation of all observations. 

By converting a distribution of 
observations into z-scores a new 
distribution is created that has a 
mean of 0 and a standard deviation 
of 1. 
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A.l Table of the standard normal distribution 



z 

Larger 

Portion 

Smaller 

Portion 

y 

.00 

.50000 

.50000 

.3989 

.01 

.50399 

.49601 

.3989 

.02 

.50798 

.49202 

.3989 

.03 

.51197 

.48803 

.3988 

.04 

.51595 

.48405 

.3986 

.05 

.51994 

.48006 

.3984 


z 

Larger 

Portion 

Smaller 

Portion 

y 

.06 

.52392 

.47608 

.3982 

.07 

.52790 

.47210 

.3980 

.08 

.53188 

.46812 

.3977 

.09 

.53586 

.46414 

.3973 

.10 

.53983 

.46017 

.3970 

.11 

.54380 

.45620 

.3965 


(Continued) 
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z 

Larger 

Portion 

Smaller 

Portion 

y 

.12 

.54776 

.45224 

.3961 

.13 

.55172 

.44828 

.3956 

.14 

.55567 

.44433 

.3951 

.15 

.55962 

.44038 

.3945 

.16 

.56356 

.43644 

.3939 

.17 

.56749 

.43251 

.3932 

.18 

.57142 

.42858 

.3925 

.19 

.57535 

.42465 

.3918 

.20 

.57926 

.42074 

.3910 

.21 

.58317 

.41683 

.3902 

.22 

.58706 

.41294 

.3894 

.23 

.59095 

.40905 

.3885 

.24 

.59483 

.40517 

.3876 

.25 

.59871 

.40129 

.3867 

.26 

.60257 

.39743 

.3857 

.27 

.60642 

.39358 

.3847 

.28 

.61026 

.38974 

.3836 

.29 

.61409 

.38591 

.3825 

.30 

.61791 

.38209 

.3814 

.31 

.62172 

.37828 

.3802 

.32 

.62552 

.37448 

.3790 

.33 

.62930 

.37070 

.3778 

.34 

.63307 

.36693 

.3765 

.35 

.63683 

.36317 

.3752 

.36 

.64058 

.35942 

.3739 

.37 

.64431 

.35569 

.3725 

.38 

.64803 

.35197 

.3712 

.39 

.65173 

.34827 

.3697 

.40 

.65542 

.34458 

.3683 

.41 

.65910 

.34090 

.3668 

.42 

.66276 

.33724 

.3653 

.43 

.66640 

.33360 

.3637 


z 

Larger 

Portion 

Smaller 

Portion 

y 

.44 

.67003 

.32997 

.3621 

.45 

.67364 

.32636 

.3605 

.46 

.67724 

.32276 

.3589 

.47 

.68082 

.31918 

.3572 

.48 

.68439 

.31561 

.3555 

.49 

.68793 

.31207 

.3538 

.50 

.69146 

.30854 

.3521 

.51 

.69497 

.30503 

.3503 

.52 

.69847 

.30153 

.3485 

.53 

.70194 

.29806 

.3467 

.54 

.70540 

.29460 

.3448 

.55 

.70884 

.29116 

.3429 

.56 

.71226 

.28774 

.3410 

.57 

.71566 

.28434 

.3391 

.58 

.71904 

.28096 

.3372 

.59 

.72240 

.27760 

.3352 

.60 

.72575 

.27425 

.3332 

.61 

.72907 

.27093 

.3312 

.62 

.73237 

.26763 

.3292 

.63 

.73565 

.26435 

.3271 

.64 

.73891 

.26109 

.3251 

.65 

.74215 

.25785 

.3230 

.66 

.74537 

.25463 

.3209 

.67 

.74857 

.25143 

.3187 

.68 

.75175 

.24825 

.3166 

.69 

.75490 

.24510 

.3144 

.70 

.75804 

.24196 

.3123 

.71 

.76115 

.23885 

.3101 

.72 

.76424 

.23576 

.3079 

.73 

.76730 

.23270 

.3056 

.74 

.77035 

.22965 

.3034 

.75 

.77337 

.22663 

.3011 
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z 

Larger 

Portion 

Smaller 

Portion 

y 

1.07 

.85769 

.14231 

.2251 

1.08 

.85993 

.14007 

.2227 

1.09 

.86214 

.13786 

.2203 

1.10 

.86433 

.13567 

.2179 

1.11 

.86650 

.13350 

.2155 

1.12 

.86864 

.13136 

.2131 

1.13 

.87076 

.12924 

.2107 

1.14 

.87286 

.12714 

.2083 

1.15 

.87493 

.12507 

.2059 

1.16 

.87698 

.12302 

.2036 

1.17 

.87900 

.12100 

.2012 

1.18 

.88100 

.11900 

.1989 

1.19 

.88298 

.11702 

.1965 

1.20 

.88493 

.11507 

.1942 

1.21 

.88686 

.11314 

.1919 

1.22 

.88877 

.11123 

.1895 

1.23 

.89065 

.10935 

.1872 

1.24 

.89251 

.10749 

.1849 

1.25 

.89435 

.10565 

.1826 

1.26 

.89617 

.10383 

.1804 

1.27 

.89796 

.10204 

.1781 

1.28 

.89973 

.10027 

.1758 

1.29 

.90147 

.09853 

.1736 

1.30 

.90320 

.09680 

.1714 

1.31 

.90490 

.09510 

.1691 

1.32 

.90658 

.09342 

.1669 

1.33 

.90824 

.09176 

.1647 

1.34 

.90988 

.09012 

.1626 

1.35 

.91149 

.08851 

.1604 

1.36 

.91309 

.08691 

.1582 

1.37 

.91466 

.08534 

.1561 


z 

Larger 

Portion 

Smaller 

Portion 

y 

.76 

.77637 

.22363 

.2989 

.77 

.77935 

.22065 

.2966 

.78 

.78230 

.21770 

.2943 

.79 

.78524 

.21476 

.2920 

.80 

.78814 

.21186 

.2897 

.81 

.79103 

.20897 

.2874 

.82 

.79389 

.20611 

.2850 

.83 

.79673 

.20327 

.2827 

.84 

.79955 

.20045 

.2803 

.85 

.80234 

.19766 

.2780 

.86 

.80511 

.19489 

.2756 

.87 

.80785 

.19215 

.2732 

.88 

.81057 

.18943 

.2709 

.89 

.81327 

.18673 

.2685 

.90 

.81594 

.18406 

.2661 

.91 

.81859 

.18141 

.2637 

.92 

.82121 

.17879 

.2613 

.93 

.82381 

.17619 

.2589 

.94 

.82639 

.17361 

.2565 

.95 

.82894 

.17106 

.2541 

.96 

.83147 

.16853 

.2516 

.97 

.83398 

.16602 

.2492 

.98 

.83646 

.16354 

.2468 

.99 

.83891 

.16109 

.2444 

1.00 

.84134 

.15866 

.2420 

1.01 

.84375 

.15625 

.2396 

1.02 

.84614 

.15386 

.2371 

1.03 

.84849 

.15151 

.2347 

1.04 

.85083 

.14917 

.2323 

1.05 

.85314 

.14686 

.2299 

1.06 

.85543 

.14457 

.2275 
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z 

Larger 

Portion 

Smaller 

Portion 

y 

1.38 

.91621 

.08379 

.1539 

1.39 

.91774 

.08226 

.1518 

1.40 

.91924 

.08076 

.1497 

1.41 

.92073 

.07927 

.1476 

1.42 

.92220 

.07780 

.1456 

1.43 

.92364 

.07636 

.1435 

1.44 

.92507 

.07493 

.1415 

1.45 

.92647 

.07353 

.1394 

1.46 

.92785 

.07215 

.1374 

1.47 

.92922 

.07078 

.1354 

1.48 

.93056 

.06944 

.1334 

1.49 

.93189 

.06811 

.1315 

1.50 

.93319 

.06681 

.1295 

1.51 

.93448 

.06552 

.1276 

1.52 

.93574 

.06426 

.1257 

1.53 

.93699 

.06301 

.1238 

1.54 

.93822 

.06178 

.1219 

1.55 

.93943 

.06057 

.1200 

1.56 

.94062 

.05938 

.1182 

1.57 

.94179 

.05821 

.1163 

1.58 

.94295 

.05705 

.1145 

1.59 

.94408 

.05592 

.1127 

1.60 

.94520 

.05480 

.1109 

1.61 

.94630 

.05370 

.1092 

1.62 

.94738 

.05262 

.1074 

1.63 

.94845 

.05155 

.1057 

1.64 

.94950 

.05050 

.1040 

1.65 

.95053 

.04947 

.1023 

1.66 

.95154 

.04846 

.1006 

1.67 

.95254 

.04746 

.0989 

1.68 

.95352 

.04648 

.0973 


z 

Larger 

Portion 

Smaller 

Portion 

y 

1.69 

.95449 

.04551 

.0957 

1.70 

.95543 

.04457 

.0940 

1.71 

.95637 

.04363 

.0925 

1.72 

.95728 

.04272 

.0909 

1.73 

.95818 

.04182 

.0893 

1.74 

.95907 

.04093 

.0878 

1.75 

.95994 

.04006 

.0863 

1.76 

.96080 

.03920 

.0848 

1.77 

.96164 

.03836 

.0833 

1.78 

.96246 

.03754 

.0818 

1.79 

.96327 

.03673 

.0804 

1.80 

.96407 

.03593 

.0790 

1.81 

.96485 

.03515 

.0775 

1.82 

.96562 

.03438 

.0761 

1.83 

.96638 

.03362 

.0748 

1.84 

.96712 

.03288 

.0734 

1.85 

.96784 

.03216 

.0721 

1.86 

.96856 

.03144 

.0707 

1.87 

.96926 

.03074 

.0694 

1.88 

.96995 

.03005 

.0681 

1.89 

.97062 

.02938 

.0669 

1.90 

.97128 

.02872 

.0656 

1.91 

.97193 

.02807 

.0644 

1.92 

.97257 

.02743 

.0632 

1.93 

.97320 

.02680 

.0620 

1.94 

.97381 

.02619 

.0608 

1.95 

.97441 

.02559 

.0596 

1.96 

.97500 

.02500 

.0584 

1.97 

.97558 

.02442 

.0573 

1.98 

.97615 

.02385 

.0562 

1.99 

.97670 

.02330 

.0551 
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z 

Larger 

Portion 

Smaller 

Portion 

y 

2.00 

.97725 

.02275 

.0540 

2.01 

.97778 

.02222 

.0529 

2.02 

.97831 

.02169 

.0519 

2.03 

.97882 

.02118 

.0508 

2.04 

.97932 

.02068 

.0498 

2.05 

.97982 

.02018 

.0488 

2.06 

.98030 

.01970 

.0478 

2.07 

.98077 

.01923 

.0468 

2.08 

.98124 

.01876 

.0459 

2.09 

.98169 

.01831 

.0449 

2.10 

.98214 

.01786 

.0440 

2.11 

.98257 

.01743 

.0431 

2.12 

.98300 

.01700 

.0422 

2.13 

.98341 

.01659 

.0413 

2.14 

.98382 

.01618 

.0404 

2.15 

.98422 

.01578 

.0396 

2.16 

.98461 

.01539 

.0387 

2.17 

.98500 

.01500 

.0379 

2.18 

.98537 

.01463 

.0371 

2.19 

.98574 

.01426 

.0363 

2.20 

.98610 

.01390 

.0355 

2.21 

.98645 

.01355 

.0347 

2.22 

.98679 

.01321 

.0339 

2.23 

.98713 

.01287 

.0332 

2.24 

.98745 

.01255 

0325 

2.25 

.98778 

.01222 

.0317 

2.26 

.98809 

.01191 

.0310 

2.27 

.98840 

.01160 

.0303 

2.28 

.98870 

.01130 

.0297 

2.29 

.98899 

.01101 

.0290 


z 

Larger 

Portion 

Smaller 

Portion 

y 

2.30 

.98928 

.01072 

.0283 

2.31 

.98956 

.01044 

.0277 

2.32 

.98983 

.01017 

.0270 

2.33 

.99010 

.00990 

.0264 

2.34 

.99036 

.00964 

.0258 

2.35 

.99061 

.00939 

.0252 

2.36 

.99086 

.00914 

.0246 

2.37 

.99111 

.00889 

.0241 

2.38 

.99134 

.00866 

.0235 

2.39 

.99158 

.00842 

.0229 

2.40 

.99180 

.00820 

.0224 

2.41 

.99202 

.00798 

.0219 

2.42 

.99224 

.00776 

.0213 

2.43 

.99245 

.00755 

.0208 

2.44 

.99266 

.00734 

.0203 

2.45 

.99286 

.00714 

.0198 

2.46 

.99305 

.00695 

.0194 

2.47 

.99324 

.00676 

.0189 

2.48 

.99343 

.00657 

.0184 

2.49 

.99361 

.00639 

.0180 

2.50 

.99379 

.00621 

.0175 

2.51 

.99396 

.00604 

.0171 

2.52 

.99413 

.00587 

.0167 

2.53 

.99430 

.00570 

.0163 

2.54 

.99446 

.00554 

.0158 

2.55 

.99461 

.00539 

.0154 

2.56 

.99477 

.00523 

.0151 

2.57 

.99492 

.00508 

.0147 

2.58 

.99506 

.00494 

.0143 

2.59 

.99520 

.00480 

.0139 
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z 

Larger 

Portion 

Smaller 

Portion 

y 

2.60 

.99534 

.00466 

.0136 

2.61 

.99547 

.00453 

.0132 

2.62 

.99560 

.00440 

.0129 

2.63 

.99573 

.00427 

.0126 

2.64 

.99585 

.00415 

.0122 

2.65 

.99598 

.00402 

.0119 

2.66 

.99609 

.00391 

.0116 

2.67 

.99621 

.00379 

.0113 

2.68 

.99632 

.00368 

.0110 

2.69 

.99643 

.00357 

.0107 

2.70 

.99653 

.00347 

.0104 

2.71 

.99664 

.00336 

.0101 

2.72 

.99674 

.00326 

.0099 

2.73 

.99683 

.00317 

.0096 

2.74 

.99693 

.00307 

.0093 

2.75 

.99702 

.00298 

.0091 

2.76 

.99711 

.00289 

.0088 

2.77 

.99720 

.00280 

.0086 

2.78 

.99728 

.00272 

.0084 

2.79 

.99736 

.00264 

.0081 

2.80 

.99744 

.00256 

.0079 

2.81 

.99752 

.00248 

.0077 

2.82 

.99760 

.00240 

.0075 

2.83 

.99767 

.00233 

.0073 


z 

Larger 

Portion 

Smaller 

Portion 

y 

2.84 

.99774 

.00226 

.0071 

2.85 

.99781 

.00219 

.0069 

2.86 

.99788 

.00212 

.0067 

2.87 

.99795 

.00205 

.0065 

2.88 

.99801 

.00199 

.0063 

2.89 

.99807 

.00193 

.0061 

2.90 

.99813 

.00187 

.0060 

2.91 

.99819 

.00181 

.0058 

2.92 

.99825 

.00175 

.0056 

2.93 

.99831 

.00169 

.0055 

2.94 

.99836 

.00164 

.0053 

2.95 

.99841 

.00159 

.0051 

2.96 

.99846 

.00154 

.0050 

2.97 

.99851 

.00149 

.0048 

2.98 

.99856 

.00144 

.0047 

2.99 

.99861 

.00139 

.0046 

3.00 

.99865 

.00135 

.0044 


3.25 

.99942 

.00058 

.0020 


3.50 

.99977 

.00023 

.0009 


4.00 

.99997 

.00003 

.0001 


All values calculated by author using SPSS. 
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A.2 Critical values of the f-distribution 


df 

0.05 

Two-Tailed Test 

0.01 

One-Tailed Test 

0.05 0.01 

1 

12.71 

63.66 

6.31 

31.82 

2 

4.30 

9.92 

2.92 

6.96 

3 

3.18 

5.84 

2.35 

4.54 

4 

2.78 

4.60 

2.13 

3.75 

5 

2.57 

4.03 

2.02 

3.36 

6 

2.45 

3.71 

1.94 

3.14 

7 

2.36 

3.50 

1.89 

3.00 

8 

2.31 

3.36 

1.86 

2.90 

9 

2.26 

3.25 

1.83 

2.82 

10 

2.23 

3.17 

1.81 

2.76 

11 

2.20 

3.11 

1.80 

2.72 

12 

2.18 

3.05 

1.78 

2.68 

13 

2.16 

3.01 

1.77 

2.65 

14 

2.14 

2.98 

1.76 

2.62 

15 

2.13 

2.95 

1.75 

2.60 

16 

2.12 

2.92 

1.75 

2.58 

17 

2.11 

2.90 

1.74 

2.57 

18 

2.10 

2.88 

1.73 

2.55 

19 

2.09 

2.86 

1.73 

2.54 

20 

2.09 

2.85 

1.72 

2.53 

21 

2.08 

2.83 

1.72 

2.52 

22 

2.07 

2.82 

1.72 

2.51 

23 

2.07 

2.81 

1.71 

2.50 

24 

2.06 

2.80 

1.71 

2.49 

25 

2.06 

2.79 

1.71 

2.49 

26 

2.06 

2.78 

1.71 

2.48 

27 

2.05 

2.77 

1.70 

2.47 

28 

2.05 

2.76 

1.70 

2.47 

29 

2.05 

2.76 

1.70 

2.46 

30 

2.04 

2.75 

1.70 

2.46 

35 

2.03 

2.72 

1.69 

2.44 

40 

2.02 

2.70 

1.68 

2.42 

45 

2.01 

2.69 

1.68 

2.41 

50 

2.01 

2.68 

1.68 

2.40 

60 

2.00 

2.66 

1.67 

2.39 

70 

1.99 

2.65 

1.67 

2.38 

80 

1.99 

2.64 

1.66 

2.37 

90 

1.99 

2.63 

1.66 

2.37 

100 

1.98 

2.63 

1.66 

2.36 

00 ( z ) 

1.96 

2.58 

1.64 

2.33 


All values computed by the author using SPSS. 
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A.3 Critical values of the F-distribution 



P 

1 

2 

3 

df (Numerator) 

4 5 

6 

7 

8 

9 

10 

1 

0.05 

161.45 

199.50 

215.71 

224.58 

230.16 

233.99 

236.77 

238.88 

240.54 

241.88 


0.01 

4052.18 

4999.50 

5403.35 

5624.58 

5763.65 

5858.99 

5928.36 

5981.07 

6022.47 

6055.85 

2 

0.05 

18.51 

19.00 

19.16 

19.25 

19.30 

19.33 

19.35 

19.37 

19.38 

19.40 


0.01 

98.50 

99.00 

99.17 

99.25 

99.30 

99.33 

99.36 

99.37 

99.39 

99.40 

3 

0.05 

10.13 

9.55 

9.28 

9.12 

9.01 

8.94 

8.89 

8.85 

8.81 

8.79 


0.01 

34.12 

30.82 

29.46 

28.71 

28.24 

27.91 

27.67 

27.49 

27.35 

27.23 

4 

0.05 

7.71 

6.94 

6.59 

6.39 

6.26 

6.16 

6.09 

6.04 

6.00 

5.96 


0.01 

21.20 

18.00 

16.69 

15.98 

15.52 

15.21 

14.98 

14.80 

14.66 

14.55 

5 

0.05 

6.61 

5.79 

5.41 

5.19 

5.05 

4.95 

4.88 

4.82 

4.77 

4.74 


0.01 

16.26 

13.27 

12.06 

11.39 

10.97 

10.67 

10.46 

10.29 

10.16 

10.05 

6 

0.05 

5.99 

5.14 

4.76 

4.53 

4.39 

4.28 

4.21 

4.15 

4.10 

4.06 


0.01 

13.75 

10.92 

9.78 

9.15 

8.75 

8.47 

8.26 

8.10 

7.98 

7.87 

7 

0.05 

5.59 

4.74 

4.35 

4.12 

3.97 

3.87 

3.79 

3.73 

3.68 

3.64 


0.01 

12.25 

9.55 

8.45 

7.85 

7.46 

7.19 

6.99 

6.84 

6.72 

6.62 

8 

0.05 

5.32 

4.46 

4.07 

3.84 

3.69 

3.58 

3.50 

3.44 

3.39 

3.35 


0.01 

11.26 

8.65 

7.59 

7.01 

6.63 

6.37 

6.18 

6.03 

5.91 

5.81 

9 

0.05 

5.12 

4.26 

3.86 

3.63 

3.48 

3.37 

3.29 

3.23 

3.18 

3.14 


0.01 

10.56 

8.02 

6.99 

6.42 

6.06 

5.80 

5.61 

5.47 

5.35 

5.26 

10 

0.05 

4.96 

4.10 

3.71 

3.48 

3.33 

3.22 

3.14 

3.07 

3.02 

2.98 


0.01 

10.04 

7.56 

6.55 

5.99 

5.64 

5.39 

5.20 

5.06 

4.94 

4.85 

11 

0.05 

4.84 

3.98 

3.59 

3.36 

3.20 

3.09 

3.01 

2.95 

2.90 

2.85 


0.01 

9.65 

7.21 

6.22 

5.67 

5.32 

5.07 

4.89 

4.74 

4.63 

4.54 

12 

0.05 

4.75 

3.89 

3.49 

3.26 

3.11 

3.00 

2.91 

2.85 

2.80 

2.75 


0.01 

9.33 

6.93 

5.95 

5.41 

5.06 

4.82 

4.64 

4.50 

4.39 

4.30 

13 

0.05 

4.67 

3.81 

3.41 

3.18 

3.03 

2.92 

2.83 

2.77 

2.71 

2.67 


0.01 

9.07 

6.70 

5.74 

5.21 

4.86 

4.62 

4.44 

4.30 

4.19 

4.10 

14 

0.05 

4.60 

3.74 

3.34 

3.11 

2.96 

2.85 

2.76 

2.70 

2.65 

2.60 


0.01 

8.86 

6.51 

5.56 

5.04 

4.69 

4.46 

4.28 

4.14 

4.03 

3.94 

15 

0.05 

4.54 

3.68 

3.29 

3.06 

2.90 

2.79 

2.71 

2.64 

2.59 

2.54 


0.01 

8.68 

6.36 

5.42 

4.89 

4.56 

4.32 

4.14 

4.00 

3.89 

3.80 

16 

0.05 

4.49 

3.63 

3.24 

3.01 

2.85 

2.74 

2.66 

2.59 

2.54 

2.49 


0.01 

8.53 

6.23 

5.29 

4.77 

4.44 

4.20 

4.03 

3.89 

3.78 

3.69 

17 

0.05 

4.45 

3.59 

3.20 

2.96 

2.81 

2.70 

2.61 

2.55 

2.49 

2.45 


0.01 

8.40 

6.11 

5.18 

4.67 

4.34 

4.10 

3.93 

3.79 

3.68 

3.59 

18 

0.05 

4.41 

3.55 

3.16 

2.93 

2.77 

2.66 

2.58 

2.51 

2.46 

2.41 


0.01 

8.29 

6.01 

5.09 

4.58 

4.25 

4.01 

3.84 

3.71 

3.60 

3.51 
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df ( Numerator ) 



P 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

19 

0.05 

4.38 

3.52 

3.13 

2.90 

2.74 

2.63 

2.54 

2.48 

2.42 

2.38 


0.01 

8.18 

5.93 

5.01 

4.50 

4.17 

3.94 

3.77 

3.63 

3.52 

3.43 

20 

0.05 

4.35 

3.49 

3.10 

2.87 

2.71 

2.60 

2.51 

2.45 

2.39 

2.35 


0.01 

8.10 

5.85 

4.94 

4.43 

4.10 

3.87 

3.70 

3.56 

3.46 

3.37 

22 

0.05 

4.30 

3.44 

3.05 

2.82 

2.66 

2.55 

2.46 

2.40 

2.34 

2.30 


0.01 

7.95 

5.72 

4.82 

4.31 

3.99 

3.76 

3.59 

3.45 

3.35 

3.26 

24 

0.05 

4.26 

3.40 

3.01 

2.78 

2.62 

2.51 

2.42 

2.36 

2.30 

2.25 


0.01 

7.82 

5.61 

4.72 

4.22 

3.90 

3.67 

3.50 

3.36 

3.26 

3.17 

26 

0.05 

4.23 

3.37 

2.98 

2.74 

2.59 

2.47 

2.39 

2.32 

2.27 

2.22 


0.01 

7.72 

5.53 

4.64 

4.14 

3.82 

3.59 

3.42 

3.29 

3.18 

3.09 

28 

0.05 

4.20 

3.34 

2.95 

2.71 

2.56 

2.45 

2.36 

2.29 

2.24 

2.19 


0.01 

7.64 

5.45 

4.57 

4.07 

3.75 

3.53 

3.36 

3.23 

3.12 

3.03 

30 

0.05 

4.17 

3.32 

2.92 

2.69 

2.53 

2.42 

2.33 

2.27 

2.21 

2.16 


0.01 

7.56 

5.39 

4.51 

4.02 

3.70 

3.47 

3.30 

3.17 

3.07 

2.98 

35 

0.05 

4.12 

3.27 

2.87 

2.64 

2.49 

2.37 

2.29 

2.22 

2.16 

2.11 


0.01 

7.42 

5.27 

4.40 

3.91 

3.59 

3.37 

3.20 

3.07 

2.96 

2.88 

40 

0.05 

4.08 

3.23 

2.84 

2.61 

2.45 

2.34 

2.25 

2.18 

2.12 

2.08 


0.01 

7.31 

5.18 

4.31 

3.83 

3.51 

3.29 

3.12 

2.99 

2.89 

2.80 

45 

0.05 

4.06 

3.20 

2.81 

2.58 

2.42 

2.31 

2.22 

2.15 

2.10 

2.05 


0.01 

7.23 

5.11 

4.25 

3.77 

3.45 

3.23 

3.07 

2.94 

2.83 

2.74 

50 

0.05 

4.03 

3.18 

2.79 

2.56 

2.40 

2.29 

2.20 

2.13 

2.07 

2.03 


0.01 

7.17 

5.06 

4.20 

3.72 

3.41 

3.19 

3.02 

2.89 

2.78 

2.70 

60 

0.05 

4.00 

3.15 

2.76 

2.53 

2.37 

2.25 

2.17 

2.10 

2.04 

1.99 


0.01 

7.08 

4.98 

4.13 

3.65 

3.34 

3.12 

2.95 

2.82 

2.72 

2.63 

80 

0.05 

3.96 

3.11 

2.72 

2.49 

2.33 

2.21 

2.13 

2.06 

2.00 

1.95 


0.01 

6.96 

4.88 

4.04 

3.56 

3.26 

3.04 

2.87 

2.74 

2.64 

2.55 

100 

0.05 

3.94 

3.09 

2.70 

2.46 

2.31 

2.19 

2.10 

2.03 

1.97 

1.93 


0.01 

6.90 

4.82 

3.98 

3.51 

3.21 

2.99 

2.82 

2.69 

2.59 

2.50 

150 

0.05 

3.90 

3.06 

2.66 

2.43 

2.27 

2.16 

2.07 

2.00 

1.94 

1.89 


0.01 

6.81 

4.75 

3.91 

3.45 

3.14 

2.92 

2.76 

2.63 

2.53 

2.44 

300 

0.05 

3.87 

3.03 

2.63 

2.40 

2.24 

2.13 

2.04 

1.97 

1.91 

1.86 


0.01 

6.72 

4.68 

3.85 

3.38 

3.08 

2.86 

2.70 

2.57 

2.47 

2.38 

500 

0.05 

3.86 

3.01 

2.62 

2.39 

2.23 

2.12 

2.03 

1.96 

1.90 

1.85 


0.01 

6.69 

4.65 

3.82 

3.36 

3.05 

2.84 

2.68 

2.55 

2.44 

2.36 

1000 

0.05 

3.85 

3.00 

2.61 

2.38 

2.22 

2.11 

2.02 

1.95 

1.89 

1.84 


0.01 

6.66 

4.63 

3.80 

3.34 

3.04 

2.82 

2.66 

2.53 

2.43 

2.34 


(Continued) 
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(Continued) 


df ( Numerator ) 



P 

15 

20 

25 

30 

40 

50 

1000 

1 

0.05 

245.95 

248.01 

249.26 

250.10 

251.14 

251.77 

254.19 


0.01 

6157.31 

6208.74 

6239.83 

6260.65 

6286.79 

6302.52 

6362.70 

2 

0.05 

19.43 

19.45 

19.46 

19.46 

19.47 

19.48 

19.49 


0.01 

99.43 

99.45 

99.46 

99.47 

99.47 

99.48 

99.50 

3 

0.05 

8.70 

8.66 

8.63 

8.62 

8.59 

8.58 

8.53 


0.01 

26.87 

26.69 

26.58 

26.50 

26.41 

26.35 

26.14 

4 

0.05 

5.86 

5.80 

5.77 

5.75 

5.72 

5.70 

5.63 


0.01 

14.20 

14.02 

13.91 

13.84 

13.75 

13.69 

13.47 

5 

0.05 

4.62 

4.56 

4.52 

4.50 

4.46 

4.44 

4.37 


0.01 

9.72 

9.55 

9.45 

9.38 

9.29 

9.24 

9.03 

6 

0.05 

3.94 

3.87 

3.83 

3.81 

3.77 

3.75 

3.67 


0.01 

7.56 

7.40 

7.30 

7.23 

7.14 

7.09 

6.89 

7 

0.05 

3.51 

3.44 

3.40 

3.38 

3.34 

3.32 

3.23 


0.01 

6.31 

6.16 

6.06 

5.99 

5.91 

5.86 

5.66 

8 

0.05 

3.22 

3.15 

3.11 

3.08 

3.04 

3.02 

2.93 


0.01 

5.52 

5.36 

5.26 

5.20 

5.12 

5.07 

4.87 

9 

0.05 

3.01 

2.94 

2.89 

2.86 

2.83 

2.80 

2.71 


0.01 

4.96 

4.81 

4.71 

4.65 

4.57 

4.52 

4.32 

10 

0.05 

2.85 

2.77 

2.73 

2.70 

2.66 

2.64 

2.54 


0.01 

4.56 

4.41 

4.31 

4.25 

4.17 

4.12 

3.92 

11 

0.05 

2.72 

2.65 

2.60 

2.57 

2.53 

2.51 

2.41 


0.01 

4.25 

4.10 

4.01 

3.94 

3.86 

3.81 

3.61 

12 

0.05 

2.62 

2.54 

2.50 

2.47 

2.43 

2.40 

2.30 


0.01 

4.01 

3.86 

3.76 

3.70 

3.62 

3.57 

3.37 

13 

0.05 

2.53 

2.46 

2.41 

2.38 

2.34 

2.31 

2.21 


0.01 

3.82 

3.66 

3.57 

3.51 

3.43 

3.38 

3.18 

14 

0.05 

2.46 

2.39 

2.34 

2.31 

2.27 

2.24 

2.14 


0.01 

3.66 

3.51 

3.41 

3.35 

3.27 

3.22 

3.02 

15 

0.05 

2.40 

2.33 

2.28 

2.25 

2.20 

2.18 

2.07 


0.01 

3.52 

3.37 

3.28 

3.21 

3.13 

3.08 

2.88 

16 

0.05 

2.35 

2.28 

2.23 

2.19 

2.15 

2.12 

2.02 


0.01 

3.41 

3.26 

3.16 

3.10 

3.02 

2.97 

2.76 

17 

0.05 

2.31 

2.23 

2.18 

2.15 

2.10 

2.08 

1.97 


0.01 

3.31 

3.16 

3.07 

3.00 

2.92 

2.87 

2.66 

18 

0.05 

2.27 

2.19 

2.14 

2.11 

2.06 

2.04 

1.92 


0.01 

3.23 

3.08 

2.98 

2.92 

2.84 

2.78 

2.58 
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P 

15 

20 

df ( Numerator ) 

25 30 

40 

50 

1000 

19 

0.05 

2.23 

2.16 

2.11 

2.07 

2.03 

2.00 

1.88 


0.01 

3.15 

3.00 

2.91 

2.84 

2.76 

2.71 

2.50 

20 

0.05 

2.20 

2.12 

2.07 

2.04 

1.99 

1.97 

1.85 


0.01 

3.09 

2.94 

2.84 

2.78 

2.69 

2.64 

2.43 

22 

0.05 

2.15 

2.07 

2.02 

1.98 

1.94 

1.91 

1.79 


0.01 

2.98 

2.83 

2.73 

2.67 

2.58 

2.53 

2.32 

24 

0.05 

2.11 

2.03 

1.97 

1.94 

1.89 

1.86 

1.74 


0.01 

2.89 

2.74 

2.64 

2.58 

2.49 

2.44 

2.22 

26 

0.05 

2.07 

1.99 

1.94 

1.90 

1.85 

1.82 

1.70 


0.01 

2.81 

2.66 

2.57 

2.50 

2.42 

2.36 

2.14 

28 

0.05 

2.04 

1.96 

1.91 

1.87 

1.82 

1.79 

1.66 


0.01 

2.75 

2.60 

2.51 

2.44 

2.35 

2.30 

2.08 

30 

0.05 

2.01 

1.93 

1.88 

1.84 

1.79 

1.76 

1.63 


0.01 

2.70 

2.55 

2.45 

2.39 

2.30 

2.25 

2.02 

35 

0.05 

1.96 

1.88 

1.82 

1.79 

1.74 

1.70 

1.57 


0.01 

2.60 

2.44 

2.35 

2.28 

2.19 

2.14 

1.90 

40 

0.05 

1.92 

1.84 

1.78 

1.74 

1.69 

1.66 

1.52 


0.01 

2.52 

2.37 

2.27 

2.20 

2.11 

2.06 

1.82 

45 

0.05 

1.89 

1.81 

1.75 

1.71 

1.66 

1.63 

1.48 


0.01 

2.46 

2.31 

2.21 

2.14 

2.05 

2.00 

1.75 

50 

0.05 

1.87 

1.78 

1.73 

1.69 

1.63 

1.60 

1.45 


0.01 

2.42 

2.27 

2.17 

2.10 

2.01 

1.95 

1.70 

60 

0.05 

1.84 

1.75 

1.69 

1.65 

1.59 

1.56 

1.40 


0.01 

2.35 

2.20 

2.10 

2.03 

1.94 

1.88 

1.62 

80 

0.05 

1.79 

1.70 

1.64 

1.60 

1.54 

1.51 

1.34 


0.01 

2.27 

2.12 

2.01 

1.94 

1.85 

1.79 

1.51 

100 

0.05 

1.77 

1.68 

1.62 

1.57 

1.52 

1.48 

1.30 


0.01 

2.22 

2.07 

1.97 

1.89 

1.80 

1.74 

1.45 

150 

0.05 

1.73 

1.64 

1.58 

1.54 

1.48 

1.44 

1.24 


0.01 

2.16 

2.00 

1.90 

1.83 

1.73 

1.66 

1.35 

300 

0.05 

1.70 

1.61 

1.54 

1.50 

1.43 

1.39 

1.17 


0.01 

2.10 

1.94 

1.84 

1.76 

1.66 

1.59 

1.25 

500 

0.05 

1.69 

1.59 

1.53 

1.48 

1.42 

1.38 

1.14 


0.01 

2.07 

1.92 

1.81 

1.74 

1.63 

1.57 

1.20 

1000 

0.05 

1.68 

1.58 

1.52 

1.47 

1.41 

1.36 

1.11 


0.01 

2.06 

1.90 

1.79 

1.72 

1.61 

1.54 

1.16 


All values computed by author using SPSS. 
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A.4 Critical values of the chi-square distribution 


df 

P 

0.05 0.01 

1 

3.84 

6.63 

2 

5.99 

9.21 

3 

7.81 

11.34 

4 

9.49 

13.28 

5 

11.07 

15.09 

6 

12.59 

16.81 

7 

14.07 

18.48 

8 

15.51 

20.09 

9 

16.92 

21.67 

10 

18.31 

23.21 

11 

19.68 

24.72 

12 

21.03 

26.22 

13 

22.36 

27.69 

14 

23.68 

29.14 

15 

25.00 

30.58 

16 

26.30 

32.00 

17 

27.59 

33.41 

18 

28.87 

34.81 

19 

30.14 

36.19 

20 

31.41 

37.57 

21 

32.67 

38.93 

22 

33.92 

40.29 

23 

35.17 

41.64 

24 

36.42 

42.98 


df 

P 

0.05 0.01 

25 

37.65 

44.31 

26 

38.89 

45.64 

27 

40.11 

46.96 

28 

41.34 

48.28 

29 

42.56 

49.59 

30 

43.77 

50.89 

35 

49.80 

57.34 

40 

55.76 

63.69 

45 

61.66 

69.96 

50 

67.50 

76.15 

60 

79.08 

88.38 

70 

90.53 

100.43 

80 

101.88 

112.33 

90 

113.15 

124.12 

100 

124.34 

135.81 

200 

233.99 

249.45 

300 

341.40 

359.91 

400 

447.63 

468.72 

500 

553.13 

576.49 

600 

658.09 

683.52 

700 

762.66 

789.97 

800 

866.91 

895.98 

900 

970.90 

1001.63 

1000 

1074.68 

1106.97 


All values computed by author using SPSS. 
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